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Preface 



The Sixth International Symposium on Spatial Databases (SSD’99) was held in 
Hong Kong, China, July 20-23, 1999. This is a “ten-year anniversary” edition. 
The series of conferences started in Santa Barbara in 1989 and continued bi- 
annually in Zurich (1991), Singapore (1993), Portland, Maine (1995), and Berlin 
(1997). We are very pleased that on this occasion Oliver Giinther, one of the 
initiators of the conference in 1989, agreed to give an “anniversary talk” in 
which he presented his view of SSD in the past as well as in the future ten years. 

SSD is well established as the premier international conference devoted to 
the handling of spatial data in databases. The number of submissions has been 
stable during the last years; in 1999 there were 55 research submissions, which 
is exactly the same number as for the last conference in Berlin. Out of these, the 
program committee accepted 17 excellent papers for presentation. In addition to 
the “anniversary talk”, the technical program contained two keynote presenta- 
tions (Christos Papadimitriou, Timos Sellis), four tutorials (Markus Schneider, 
Leila De Floriani and Enrico Puppo, Jayant Sharma, and Mike Freeston), and 
one panel. 

The papers included in these proceedings reflect some of the current trends 
in spatial databases. Classical topics such as spatial indexing or spatial join 
continue to be studied, as well as interesting new directions such as including 
generalization/scale in indexing or treating multiway instead of binary joins. 
Some topics such as generalization, spatial data mining, or treatment of uncer- 
tainty have been around for a while and continue to be in the focus of interest. 
A strong newcomer in this volume is the topic of spatio-temporal databases, 
especially with a “moving objects” flavor. One reason for increased interest in 
this area is the existence of the related European Research Networks; this was 
also addressed in the keynote speech by Timos Sellis. 

Furthermore, for the first time SSD offered a track with presentations of 
“industrial and visionary applications” papers. We hope that this will further 
promote the exchange of ideas between practitioners and researchers, especially 
by pointing out research problems arising in practice, and giving researchers 
feedback on how suitable their solutions are within a system context. These 
contributions were selected by a small industrial subcommittee of the program 
committee. 

Numerous people helped to make this conference a success. First, the mem- 
bers of the program committee as well as the external referees did an excellent 
job in providing careful reviews, and in conducting engaged conflict resolution 
discussions via email before the program committee meeting. The program com- 
mittee members who attended the meeting in Hagen deserve special credit. We 
also thank Agnes Voisard for her help in promoting this event, and for a lot of 
good advice based on her experience with the last conference. Thanks to Dave 
Abel for organizing a careful selection process for the industrial contributions. 
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Topological Queries 



Christos H. Papadimitriou* 

University of California, Berkeley, USA 
Christ osScs .berkeley.edu 



A spatial database contains information about several two-dimensional regions. 
Of all queries one may ask about such a database, the topological queries are 
those that are invariant under continuous invertible deformations and reflections 
of the database. For example, of these three queries ‘region A has larger area 
than region B ‘region A is above region B ’ and ‘region A and region B have 
a connected intersection,’ only the third one is topological. Topological queries 
form a robust and practically interesting subclass of all possible queries one may 
want to ask of a spatial database. Furthermore, they give rise to an unexpectedly 
complex and rich theory. 

The most elementary query language imaginable, the Boolean closure of the 
eight basic topological relations |2| between two regions — such as disjoint{A, B) 
and includes{B , C) — is already surprisingly complicated (it is open whether its 
satisfiability problem is decidable PJ), and has given rise to some novel prob- 
lems in the theory of planar graphs Allowing quantifiers that range over all 
regions yields a much more powerful language, which ultimately can express all 
topological queries, but is of formidable complexity jlj (this was to be expected, 
as any language expressing all topological queries must necessarily be very com- 
plex). However, it turns out that all topological queries can be expressed as 
relational queries on the topological invariant of the spatial database, a graph 
structure that captures the planar embedding of the planar graph defined by the 
boundaries of the regions 0 ; there is much known about the expressiveness and 
complexity of queries on the topological invariant mi- The class of topological 
queries expressible in another possible query language, the first-order theory of 
reals (where the quantifiers range over real coordinates of the plane), has been 
delimited severely E0. 

The major open question in the area of topological queries is the design of a 
natural and efficiently interpretable query language that captures all efficiently 
computable topological queries of a spatial database — that is to say, of a natural 
and expressive topological query language with reasonable implementation cost. 
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Timos Sellis 
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Abstract. Spatiotemporal database management systems can become an ena- 
bling technology for important applications such as Geographic Information 
Systems (GIS), environmental information systems, and multimedia. In this pa- 
per we address research issues in spatio-temporal databases, by providing an 
analysis of the challenges set, the problems encountered, as well as the pro- 
posed solutions and the envisioned research areas open to investigation. 



1 Introduction 

Temporal databases and spatial databases have long been separate, important areas of 
database research, and researchers in both areas have felt that there are important 
connections in the problems addressed hy each area, and the techniques and tools 
utilized for their solution. So far, relatively little systematic interaction and synergy 
among these two areas have occurred. Current research aims to achieve exactly this 
kind of interaction and synergy, and aims also to address the many real-life problems 
that require spatio-temporal concepts that go beyond traditional research in spatial 
and temporal databases. Spatio-temporal database management systems (STDBMSs) 
can become an enabling technology for important applications such as Geographic 
Information Systems (GIS), environmental information systems, and multimedia. In 
this paper we address research issues in spatio-temporal databases, by providing an 
analysis of the challenges set, the problems encountered, as well as the proposed 
solutions and the envisioned research areas open to investigation. 

Most of the ideas presented in this paper are based on the experience with a re- 
search project that has been going on since 1996, ChoroCHRONOS. ChoroCHRONOS 
was established as a Research Network with the objective of studying the design, 
implementation, and application of STDBMSs. The participants of the network are 
the Institute of Computer and Communication Systems of the National Technical 
University of Athens, Aalborg University, FernUniversitat Hagen, Universita Degli 
Studi di L'Aquila, UMIST, Politecnico di Milano, INRIA, Aristotle University of 
Thessaloniki, Agricultural University of Athens, Technical University of Vienna, and 
ETH. All these are established research groups in spatial and temporal database sys- 
tems, most of which have so far been working exclusively on spatial or temporal 
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databases. ChoroCHRONOS enables them to collaborate closely and to integrate their 
findings in their respective areas. 

To achieve the objective of designing spatio-temporal databases, several issues 
need to be addressed; these are related to (a) the ontology, structure, and representa- 
tion of space and time, (b) the data models and query languages, (c) graphical user 
interfaces, (d) query processing algorithms, storage structures and indexing tech- 
niques, and (e) architectures for STDBMSs. 



2 Overview of Research Issues 

Put briefly, a spatio-temporal database is a database that embodies spatial, temporal, 
and spatio-temporal database concepts, and captures simultaneously spatial and tem- 
poral aspects of data. 

All the individual spatial and temporal concepts (e.g., rectangle or time interval) 
must be considered. However, attention focuses on the area of the intersection be- 
tween the two classes of concepts, which is challenging, as it represents inherently 
spatio-temporal concepts (e.g., velocity and acceleration). In spatio-temporal data 
management, the simple aggregation of space and time is inadequate. Simply con- 
necting a spatial data model to a temporal data model will result in a temporal data 
model that may capture spatial data, or in a spatial data model that may capture time- 
referenced sequences of spatial data. 

Rather, the temporal characteristics of spatial objects (i.e., how entities evolve in 
space) must be investigated in order to produce inherently spatio-temporal concepts 
such as unified spatio-temporal data structures, spatio-temporal operators (e.g., ap- 
proach, shrink), and spatio-temporal user-interfaces. 

The main topics of interest when studying the issues involved in spatio-temporal 
databases are: 

• Ontology, Structure, and Representation of Space and Time. This involves the 
study of temporal and spatial ontologies, including their interrelations and their 
utility in STDBMSs. In addition, structural and representational issues as they 
have been articulated in spatial and temporal database research should be consid- 
ered in order to obtain a common framework for spatio-temporal analysis. 

• Models and Languages for STDBMSs. The focus here is on three topics: (i) the 
study of languages for spatio-temporal relations, (ii) the development of models 
and query languages for spatio-temporal databases, and (hi) the provision of de- 
sign techniques for spatio-temporal databases. This work builds on previous pro- 
posals and covers relational and object-oriented databases. 

• Graphical User Interfaces for Spatio-temporal Information. Research in this area 
has two goals: (i) the extension of graphical interfaces for temporal and spatial da- 
tabases, and (ii) the development of better visual interfaces for specific applica- 
tions (e.g. VRML for time-evolving spaces). 

• Query Processing in Spatio-temporal Databases. Techniques for the efficient 
evaluation of queries are the focus of this area. These studies cover a variety of 
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optimization techniques, ranging from algebraic transformations to efficient 
page/object management. 

• Storage Structures and Indexing Techniques for Spatio-temporal Databases. Re- 
search in this area involves the integration or mixing of previously proposed stor- 
age and access structures for spatial and/or temporal data. 

• The Architecture of an STDBMS. Finally, care must be taken in developing real 
systems, and therefore the architecture of a STDBMS is of high interest. 

After this brief outline of the research areas, we proceed to give more detailed de- 
scriptions of some of these areas. 



3 Spatio-temporal Data Ontologies and Modeling 

In this section we address some issues involved in the ontology of spatial entities as 
well as the ontology of space itself, and issues corresponding to the development of 
conceptual and logical models, along with respective languages, for spatio-temporal 
data. 



3.1 Ontological Issues 

Regarding the ontology of spatial entities, in order to model change in geographic 
space, a distinction is made between life (the appearance and disappearance, and 
merging and splitting of objects) and motion (the change of location over time) of 
objects. At its outset, this research must identify and investigate prototypical situa- 
tions in the life and movement of objects in geographic space. 

Regarding the ontology of space, one should observe that spatial objects are lo- 
cated at regions in space. The concept of exact location is a relation between an ob- 
ject and the region of space it occupies. Spatial objects and spatial regions have a 
composite structure, i.e., are made up of parts. The ways in which parts of objects are 
located at parts of regions of space are captured by the notion of part location. Since 
there are multiple ways for parts of spatial objects to be located at parts of regions of 
space, multiple part location relations are identified, and a classification of part loca- 
tion relations is needed. An example of such work is the work on rough locations 
[2]. Rough locations are characterized by sets of part location relations that relate 
parts of objects to parts of partition regions [1]. 



3.2 Models and Languages 

Models and languages for spatio-temporal database systems are a central activity, as it 
serves as the basis for several other tasks (for example, query processing and optimi- 
zation). This research may be divided into two categories: a) research that focuses on 
tightly integrated spatio-temporal support, and b) previously initiated efforts that have 
dealt mainly with temporal aspects, extended to also cover spatial aspects. We con- 
sider in turn research in each category. 
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An important effort focuses on identifying the main requirements for a spatio- 
temporal data model and a spatio-temporal DBMS. Based on a data type approach to 
data modeling, the concept of moving objects has been studied in [4, 5]. This has led 
to a series of several results leading from data model specifications (at two different, 
abstract levels) to implementation aspects. Having specified mappings between dif- 
ferent data type models and their relational DBMS embeddings, a precise landscape 
of the models’ relative expressive powers has been drawn in [6]. Finally, concrete 
spatio-temporal data models for moving objects have also been provided. Here the 
focus has been on a systematic classification of operations on relevant data types that 
facilitates a highly generic specification and explanation of all types and, in particu- 
lar, of the operations on them. A detailed description of this model, including formal 
specifications of the types and operations is presented in [10] along with examples 
demonstrating the data model in action. 

A significant effort in the area of models and languages also deals with constraint 
database models. These models constitute a separate direction that researchers are 
exploring in modeling spatio-temporal information. As an example, DEDALE is a 
prototype of a constraint database system for spatio-temporal information [7,8]. 
DEDALE is implemented on top of the 0^ DBMS and features graphical querying. It 
offers a linear-constraint abstraction of geometric data, allowing the development of 
high-level, extensible query languages with a potential for optimization, while al- 
lowing the use of optimal computation techniques for spatial queries. Also in the 
context of constraint database models, other researchers have studied the role of spa- 
tial and temporal constraints in STDBMS [14]. The efforts here concentrated on the 
development of a new spatio-temporal constraint-based database model. This model 
is based on the linear constraint database model of DEDALE and the indefinite tem- 
poral constraint database model (ITCDB) previously proposed by Koubarakis. 

Other research teams pursue research in extending relational models and lan- 
guages. For example, the core of an SQL-based language, STSQL, has been proposed 
[3]. This language generalizes previous proposals by permitting relations to include 
multiple temporal as well as spatial attributes, and it generalizes temporal query lan- 
guage constructs, to apply to both the spatial and temporal attributes of relations. 
Because space and time are captured by separate attributes, STSQL is intended for 
applications that do not involve storing the movement of continuously moving ob- 
jects. 

Spatial and temporal conceptual modeling extends previous work on temporal and 
spatial data modeling. Spatial modeling aspects, e.g., the representation of objects’ 
“position” in space, as well as temporal modeling aspects, e.g., the capture of the 
valid time of objects’ properties, have been studied, and resulting new modeling con- 
structs have been applied to existing conceptual models such as the ER model. Fur- 
thermore, the structure and behavior of so-called spatio-temporal phenomena (e.g., a 
“storm”) have been investigated, and a formal framework with a small set of new 
modeling constructs for capturing these during conceptual design, has been defined 
[19, 20,21,22]. 

Finally, modeling issues related to uncertain spatio-temporal data need to be ex- 
amined. By adopting fuzzy set methodologies, for example, a general spatial data 
model can be extended to incorporate the temporal dimension of geographic entities 




Research Issues in Spatio-temporal Database Systems 



9 



and their uncertainty. In addition, the basic data interpretation operations for handling 
the spatial dimension of geographic data have been extended to also support spatio- 
temporal reasoning and fuzzy reasoning. Some work has already been initiated in this 
direction [13]. 



4 Storage Structures, Indexing Techniques, and Query Processing 

Having given a brief overview of the data modeling efforts undertaken, this section 
concentrates on efforts to develop techniques for the efficient implementation of the 
proposed data models and languages. 

Substantial efforts have been devoted to the study of storage structures and in- 
dexing. In particular, (a) efficient extensions of spatial storage structures to support 
motion have been proposed, and (b) benchmarking issues have been studied. 

Modern DBMSs should be able to efficiently support the retrieval of data based on 
the spatio-temporal extents of the data. To achieve this, existing multidimensional 
access methods need to be extended. Work has already initiated in this area. For 
example, approaches that extend R-trees and quadtrees were reported in [18] and 
[23], respectively, along with extensive experiments on a variety of synthetic data 
sets. 

Work on benchmarking issues for spatio-temporal data has also started and is re- 
ported in [16]. This work introduced basic specifications that a spatio-temporal index 
structure should satisfy, evaluated existing proposals with respect to the specifica- 
tions, and illustrated issues of interest involving object representation, query proc- 
essing, and index maintenance. As a second step, a benchmarking environment that 
integrates access methods, data generation, query processing, and result analysis 
should be developed. The objective is to obtain a common platform for evaluating 
spatio-temporal data structures and operations that are connected to a data repository 
and a synthetic data set generator. A platform for evaluating spatiotemporal query 
processing strategies has been designed andimplemented and has been already used 
for evaluating spatial join strategies [11]. The “A La Carte” environment also pro- 
vides benchmarking features for spatial join operations [9]. Finally, a very important 
step in this direction has been the work on generating spatio-temporal data in a con- 
trolled way so that benchmarks can be run [17]. 

Work on query processing and optimization has focused thus far on (a) the devel- 
opment of efficient strategies for processing spatial, temporal, and inherently spatio- 
temporal operations, (b) the development of efficient cost models for query optimiza- 
tion purposes, and (c) the study of temporal and spatial constraint databases. 

In [15] it was argued that expressing spatial operations, required by different appli- 
cation domains, is possible through a set of window searches, so that their execution 
could be supported by the available spatial indexing techniques. When the availability 
of index structures is not guaranteed, incremental algorithms have been proposed to 
support join operations for time-oriented data [12]. Regarding the execution of inher- 
ently spatio-temporal operations, the basic classes of spatio-temporal operations re- 
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quired by different application domains involving the representation and reasoning on 
a dynamic world should be defined [24]. 



5 Conclusions 

In this paper we presented some issues related to the research undergone in spatio- 
temporal databases. Clearly, significant progress has been achieved in several areas; 
these include the understanding of the requirements of spatio-temporal applications, 
data models, indexing structures, and query evaluation. 

Although the research community has made significant progress, much work re- 
mains to be done before an STDBMS may become a reality. Open areas include 

• devising data models and operators with clean and complete semantics, 

• efficient implementations of these models and operators, 

• work on indexing and query optimization, 

• experimentation with alternative architectures for building STDBMSs (e.g. lay- 
ered, extensible, etc). 

One can observe that this is an exciting new area and the spatial database commu- 
nity will have a word in it! 
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Since the 1960s, researchers and practitioners in the geosciences have been work- 
ing on computer solutions to match their specific needs. Commercial geographic 
information systems (GIS) are among the most important outcomes of these ef- 
forts. In those early years, computer scientists have been involved only marginally 
in the development of such systems. More generally speaking, cooperations be- 
tween computer scientists and the geoscientific communities have traditionally 
been rare. 

In the 1980s, however, this has started to change, and the increasing number 
of contacts is now bearing fruit. The design and implementation of data manage- 
ment tools for spatial applications is pursued by an interdisciplinary community 
of people from academia, government, and industry. Spatial databases are now 
considered an important enabling technology for a variety of application software 
(such as CAD systems or GIS), and they start to find their way into mainstream 
business software (such as ERP systems). Many commercial database manage- 
ment systems have special toolboxes for the management of spatial data. There 
have been several interdisciplinary research projects of high visibility, including 
the U.S. National Center for Geographic Information and Analysis (NCGIA) ^ 
and the Sequoia 2000 project 1 1 ( Ittj . U.C. Berkeley’s Digital Environmental Li- 
brary (ELIB) pioiect j I and U.C. Santa Barbara’s AZea:andrja project |2| are 
pursuing related goals; both projects are funded through the NSF/ARPA/NASA 
Digital Library Initiative. 

It was also in the 1980s that a number of interdisciplinary conference series 
were launched. In 1984, GIS researchers initiated the first Symposium on Spa- 
tial Data Handling (SDH). Five years later, an NCGIA research initiative led to 
the first Symposium on Spatial Databases (SSD). Compared to SDH, SSD has 
a stronger focus on computer technology. 1993 was the year of the first Confer- 
ence on Spatial Information Theory (COSIT) and the first ACM Workshop on 
Geographic Information Systems (ACM-GIS). All of these conference series con- 
tinue to be held annually (ACM-GIS) or biannually (COSIT, SDH, SSD). Their 
proceedings often appear as books with major publishers (e.g. SSD: I 

These and related activities have resulted in numerous interdisciplinary pub- 
lications, as well as system solutions and commercial products. They also led to 
a certain refocus of one of the premier journals in the area, the International 
Journal on Geographic Information Science (formerly the International Journal 
on Geographic Information Systems), published by Taylor & Francis. In 1996, 
Kluwer Academic Publishers started another journal in this area, called Geoln- 
formatica. 
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At the organizational level, an important event happened in 1994, when an 
international group of GIS users and vendors founded the Open GIS Consor- 
tium (OGC). OGC has quickly become a powerful interest group to promote 
open systems approaches to geoprocessing 0. It defines itself as a “membership 
organization dedicated to open systems approaches to geoprocessing.” It pro- 
motes an Open Geodata Interoperability Specifieation (OGIS), which is a com- 
puting framework and software specification to support interoperability in the 
distributed management of geographic data. OGG seeks to make geographic data 
and geoprocessing an integral part of enterprise information systems. More in- 
formation about OGG including all of their technical documents are available at 
the consortium’s Web site, http://www.opengis.org. 

This short historical overview shows that spatial data management has come 
a long way, both as an academic discipline and as an application of advanced 
computing technology. The SSD conference series had an important role in these 
developments. Since SSD was first held 10 years ago in Santa Barbara, Galifornia, 
the conference series has established itself as an important cornerstone of the 
community. It continues to be an important opportunity to meet one’s peers and 
to “talk shop,” i.e. to discuss highly specific technical issues. Interdisciplinary 
communication between geoscientists and computer scientists is the norm, as is 
technology transfer between researchers and practitioners. In all these respects, 
SSD is a good example of the smaller, more focused conferences that complement 
the larger venues, such as SIGMOD or VLDB. 

If one compares the programs of SSD 1989 and SSD 1999, one observes that, 
on the one hand, the 1999 program reflects some of the major paradigm changes 
in data management. There are several papers on Internet-related topics and on 
data mining - two topics that arguably had the most fundamental impact on 
data management during the past 10 years. 

On the other hand, many of the principal topics do not seem to have changed 
in a major way. Most papers presented at SSD 1999 can still be grouped into 
the following five major categories that were already present in 1989: 

— physical data management: access methods, query optimization; 

— logical data management: topology, modeling, algebras; 

— distributed data management: Internet-related aspects, fault tolerance; 

— spatial reasoning and cognition; 

— GIS applications. 

The work performed under these headings is mostly a natural continuation 
of the work conducted 10 years ago. Some of these categories are represented 
stronger than others, partly to complement other conference series on spatial 
data management. GOSIT, for example, is traditionally strong on spatial rea- 
soning and cognition issues. SDH serves as an important outlet for application- 
oriented work by the geoscientific community. Both of these areas take a some- 
what weaker role at SSD. Other categories are conspicuously missing at all three 
conference series: papers on human-computing interfaces, for example, are rare. 

I see two major reasons for this relative constancy. First, the essential techni- 
cal requirements regarding spatial databases and geographic information systems 
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have not changed since ten years ago, except for certain technical issues raised 
by the increased importance of the Internet. In particular, the much-heralded 
switch to main memory data management has not happened (yet). Second, re- 
searchers do not always just do what the marketplace demands. Many of us 
choose their focus based on the intellectual stimulus provided by certain classes 
of problems, as well as their apparent solvability in principle. This leads to a 
longer-term orientation of one’s research focus. Let me give two examples. 

The definition of a consistent and complete spatial algebra, for example, 
may seem of secondary relevance to most commercial GIS vendors and users. 
On the other hand, work on algebras can be great fun for researchers who are 
mathematically inclined and who like the idea of bringing structure to what 
used to be an unorganized collection of data types and operators. Even though 
a number of researchers has worked on related problems, the issue has not been 
completely solved, and I certainly expect related papers on the program of SSD 
2009. This includes extensions to the classical approach, such as spatio-temporal 
algebras. 

Spatial access methods, my own area, has matured greatly during the past 
10 years. Great papers were written - some of them even received awards for 
their long-term impact. Of course, there were also many papers that did not 
really help to move the field forward: yet another structure was proposed, yet 
another set of very narrow performance evaluations were conducted, and the 
reader was left puzzled how and why the structure would be better than any 
of the 100 methods already known. Technology transfer into modern GIS has 
somewhat slowed down, after some initial successes. Nevertheless, the field still 
attracts researchers young and old. Why? Because the related problems are well- 
defined, solving them is intellectually stimulating, and even if one does not have 
an earthshattering new idea, the chances of obtaining a publishable paper after 
a few months of relatively straightforward work are not bad. In addition, there 
are many areas of possible extensions of high practical relevance, including high- 
dimensional applications (such as data mining) or spatio-temporal modeling. 
Once again, I am certain to find related papers on the agenda of SSD 2009. 

Obviously there is a certain discrepancy between short- and mid-term prac- 
tical needs on the one hand and researcher’s interests on the other hand. Person- 
ally speaking, I do not find this problematic, even though we are working in an 
application-oriented field. On the contrary, I find this discrepancy essential for 
the research paradigm we have adopted. Like most of my colleagues in research, 
I strongly believe that a research community can deliver “useful” results only if 
researchers have the right and the means to play, i.e., to work on issues that they 
just enjoy working on. It is our duty as a research community, however, that we 
help policy makers in defining about how many people should have this privilege, 
and that we identify the right people among our graduate students and junior 
colleagues to enjoy it. Having too many journals and conferences with too many 
papers that nobody is really interested in can be as detrimental to a research 
community as having not enough outlets. 
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With regard to SSD, this means that we should continue to be selective, and 
that we should continue to honor work that balances intellectual stimulus with 
practical relevance. More concretely: we have to continue to honor work that 
solves difficult intellectual problems without regarding their immediate practical 
relevance. On the other hand, we also have to continue to honor good practical 
solutions that reflect the technical state-of-the-art, even though the underlying 
concepts may seem somewhat obvious to an academic researcher. The challenge 
for current and future SSD program committees has always been to balance 
these two approaches to research. So far I believe we have succeeded in finding 
the right balance, and I am looking forward to a continuation of this tradition. 
SSD “ ad multos annos! 
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Abstract. This work is a contribution to the developing literature on 
multi-resolution data models. It considers operations for model-oriented 
generalization in the case where the underlying data is structnred as a 
graph. The paper presents a new approach in that a distinction is made 
between generalizations that amalgamate data objects and those that 
select data objects. We show that these two types of generalization are 
conceptually distinct, and provide a formal framework in which both can 
be nnderstood. Generalizations that are combinations of amalgamation 
and selection are termed simplifications, and the paper provides a formal 
framework in which simplifications can be computed (for example, as 
compositions of other simplifications). A detailed case stndy is presented 
to illnstrate the techniques developed, and directions for fnrther work 
are discussed. 



1 Introduction 

Specialist spatial information systems (SIS) play an increasingly important role 
within the Information Technology industry . For the potential of SIS to 

be fully realised, spatial database functionality needs to be integrated with other 
more generic aspects of database technology. Spatial data comprise a valuable 
subset of the totality of data holdings of an enterprise and their utility is opti- 
mized when they are flexible enough to be capable of integration with other data 
sets in a variety of ways. The focus of this paper is a contribution towards the 
provision of flexibility with regard to the scale or resolution at which data are 
handled. Resolution is concerned with the level of discernibility between elements 
of a phenomenon that is being represented by the data, and higher resolutions 
allow more detail to be observed in the components of the phenomenon. Flex- 
ibility in handling resolution is advanced by provision of multi-resolution data 
models, where data are managed in the SIS at a variety of levels of detail. 

The issue of multi-resolution spatial datasets has been taken up by several 
authors (e.g. lEnnni Esnsi)- In our own earlier work [SW981 IWorDSal IWorflSh] 
we proposed a general model that helps to provide a formal basis for processing 
and reasoning with spatial data that are heterogeneous with regard to semantic 
and geometric precision. For multi-resolution data models to be effective, there 
must be the means to make appropriate transitions between levels of detail 
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in the data. Transition from higher to lower resolutions is often referred to as 
generalization. Cartographic generalization has been the subject of a great deal of 
research by the cartographic and CIS communities, particularly on the geometric 
aspects of the generalization process (see for example 



When 

the word ‘generalization’ is used in this paper it usually refers to model-oriented 
generalization, in the sense of IM^ . as we are not concerned here with the 
particular form of the cartographic representation of the data. 

The focus of this paper is on the geometric components of geospatial data. 
In particular, the emphasis here is on network data structures, as they provide 
a simpler case than fully two-dimensional data structures, and yet have many 
applications to real world systems. The paper seeks to make a clear formal 
distinction between model-oriented generalizations that are based on selection 
of data and those that are based on data amalgamation. This is a distinction 
that is somewhat blurred in some of the earlier multi-resolution spatial data 
models. 

In the next section, the background to this research is outlined, particularly 
in the context of multi-resolution data models, generalization and functional- 
ity in databases for handling graphs. The following section makes precise the 
distinction between selection and amalgamation transformations from higher to 
lower levels of detail. The remainder of the paper is devoted to working out in 
detail the formal properties of selection and amalgamation operations on graphs, 
and includes consideration of a detailed case study. 



2 Background 

2.1 Multi-resolution Data Models and Generalization 

Generalization is the process of transforming a representation of a geographic 
space to a less detailed one. The representation may be in terms of a data/process 
model, in which case the transformation is called model generalization, or involve 
visualization of the space on an output device or hard copy, in which case the 
transformation is called cartographic generalization. Cartographic generalization 
has been the subject of a great deal of research by the cartographic and GIS com- 
munities, particularly on the geometric aspects of the generalization process (see 
for example ITJMfllLfKrnw^ V Model-oriented generalization was introduced by 
Muller et al. |M+95| . Rigaux and Scholl discuss the impact of scale and 

resolution on spatial data modelling and querying. They develop a theory with 
spatial and semantic components and apply the ideas to a partial implementation 
in the object-oriented DBMS O 2 . 

A multi-resolution model of a geographic space affords representations at a 
variety of levels of detail as well as providing a structure in which these repre- 
sentations are located. In such models, generalization operators are required in 
order that transitions between different locations in the structure can be made. 
Puppo and Dettori [PD95| provide a formal model of some of the topological and 
metric aspects of multi-resolution using abstract cell complexes and homotopy 
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theory - both topics within algebraic topology. These ideas are further developed 
by Bertolotto , who proposes a definition of a base set of transformations, 

from which set a significant class of generalization operators can be obtained. 
This body of work provides one of the motivations for the current work, in 
that the earlier work does not seek to make a distinction between selection of 
features, where certain features of the phenomena are chosen and others omit- 
ted, and amalgamation of features, where certain features originally considered 
distinguishable are made indistinguishable. In current multi-resolution models, 
selection and amalgamation are not distinguished, yet they are conceptually 
quite distinct. 

We use the term simplification for a generalization which can be described 
as a selection followed by an amalgamation. We are aware that ‘simplification’ 
does refer to a very specific operation in the literature on cartographic general- 
ization, and that we are using the word in a more general sense here. However, 
this word has been used by Puppo and Dettori |fl pl61] in a way that 
fits very closely with the present paper. Puppo and Dettori define ‘simplifica- 
tion mappings’ which are certain mappings between cell complexes. A particular 
simplification mapping F : F F' provides a reason why the cell complex F' 
is a simplified version of F. This leads to a category fRW95j . where the objects 
are cell complexes and the morphisms are the simplification mappings. In our 
work we have a category where the objects are graphs, and the morphisms are 
simplifications in our sense. 

Further work on the amalgamation properties of multi-resolution data mod- 
els is discussed by Worboys jWor98bj . where a lattice of resolution is constructed 
and properties of entities represented at differing degrees of granularity consid- 
ered. This theme is pursued further in jWor98aj by showing that the resolution 
lattice can be applied to both geometric and semantic resolutions. Stell and Wor- 
boys jSW98| develop the formal properties of the resolution lattice, showing how 
each resolution in the lattice gives rise to a space of spatial data representations 
all with respect to that resolution, and the totality of spatial data represen- 
tations being stratified by the resolution lattice. Generalization operators and 
their inverses can be considered as transitions between layers in the stratified 
spatial data space. Stell has also recently provided an analysis of different 

notions of granularity for graphs. 



2.2 Handling Graphs in Databases 

In this paper, the techniques developed with regard to selection and amalgama- 
tion operators are applied as transitions between resolutions in a multi-resolution 
data model in the particular context of graphs. This is done for three reasons: 

1. Graphs provide an intermediate level between non-spatial data and full pla- 
nar spatial data, and are sufficiently rich to illustrate the application of the 
approach. 

2. Graphs have many applications in spatial information systems, for example 
road and rail networks, and cable and other utility networks. 
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3. The dual graph of an areal partition of the plane (where nodes of the dual 
graph are associated with areas in the partition, and nodes are connected by 
an edge if and only if the areas are adjacent in the partition) is an important 
indicator of the topological relationships between the areas in the partition. 

The database community has put some effort into considering how generic data- 
base technology can be used to provide functionality for handling graph data 
structures. Mannino and Shapiro |MS9()| survey work on extending the relational 
database model to incorporate functionality for handling graphs, including ex- 
tensions to query languages for graph traversal. Giiting insia, presented an 
approach that extended the relational data model with data types for planar- 
embedded graphs. Giiting’s 1992 paper was followed by a series of papers in 
which he and colleagues developed the theme of incorporating graph handling ca- 
pabilities in database systems jGi"t94i rRnMj lHG9,^| . Erwig and Schneider ffiS97j 
pose the question of the meaning of vagueness with reference to a graph. Stell and 
Worboys |SW97| have discussed the algebraic structure of the set of subgraphs 
of a graph. 



3 Selection and Amalgamation 

A major motivation of this work is to clarify the distinction between selection 
and amalgamation generalization operations. In this section we explore the foun- 
dations of the concept “less detailed than”, based on the notions of selection and 
amalgamation. At the most abstract level, there are two ways in which data 
represented by a structure X can be less detailed than data represented by a 
structure Y. 

selection: X can be derived from Y by selecting certain features, and possibly 
leaving out others. 

amalgamation: X can be derived from Y by amalgamating some features of 
Y so that some distinct things in Y are regarded as indistinguishable and 
become just one thing in X. 



3.1 Amalgamation and Selection for Sets 

The two operations are illustrated in the case of sets X and Y, the simplest 
formal structures, by the following concrete examples. 



Y Heathrow 4i\ 
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/Piccadilly Circus ^ 






VLeicester Square ^ 




\ Londony 
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In the amalgamation example, the set V consists of four stations on the London 
Underground. For some application it may be inappropriate to distinguish be- 
tween the two individual stations at Heathrow Airport. Similarly, the stations 
Piccadilly Circus and Leicester Square are physically close together, and at a 
lower level of detail, the distinction between them may not be important. By 
avoiding the distinctions between these pairs of stations, we arrive at the set X 
as a less detailed representation of the data in Y. 

The example of selection is also derived from actual data about the London 
Underground. Here again U is a set representing four individual stations which 
is represented at a lower level of detail by a set X containing only two elements. 
However, in this case the operation performed on Y to produce X is quite dif- 
ferent. The stations present in X are selected from those in Y because of their 
relative importance. Northwood and Pinner are minor stations, and many trains 
which do stop at Moor Park and Harrow do not stop at the two smaller stations. 

When X and Y are sets it is straightforward to formalize the notions of 
amalgamation and selection. If the relationship of A to U is one of selection, 
then there is an injective (or one-to-one) function from X to Y. If the relationship 
is one of amalgamation, then there is a surjective (or onto) function from Y to 
A. 

3.2 Combining Amalgamation and Selection for Sets 

The above examples deal with two ways in which A may be a less detailed 
representation of Y. In more complicated examples the relationship need not be 
solely one of amalgamation or selection. In general, a loss of detail relationship 
between A and Y will involve both selection and amalgamation. This entails 
a set Z which is obtained from Y by selection, and which is amalgamated to 
produce A. A simple example appears in the the following diagram. 



Y / Heathrow 4 


- - ■ 

X Heathrow 4"\ 


y" , X 

Heathrow x 


Heathrow 1-3 »... 




\ 


Piccadilly Circus 


*Heathrow 1-3 




Leicester Square , 

,/ 


Piccadilly Circu^ 











A pair consisting of a selection followed by an amalgamation will be called a 
simplification from Y to A. Formally, a simplification from a set A to a set 
A consists of three things: a set Z, an injective function from Z to Y (the 
selection part) and a surjective function from Z to X (the amalgamation part). 
Alternatively we can describe a simplification from Y to A as a partial surjective 
function from A to A. 

It might appear that by defining a simplification to consist of a selection 
followed by an amalgamation, we are being unnecessarily restrictive. It is nat- 
ural to ask whether this definition of simplification excludes an amalgamation 
followed by a selection, or a sequence of the form [si, ai, S2, 02, • • • , s„, a„] where 
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each ai is an amalgamation, and each Si is a selection. In fact, provided we are 
dealing with simplifications of graphs, or of sets, every sequence of the above 
form can be expressed as a single selection followed by a single amalgamation. 
The justification for this lies in the fact that simplifications can be composed. 
For simplifications of sets, this is discussed in the following paragraph. For sim- 
plifications of graphs, composition is illustrated by an example in sectional and 
defined formally in section oi It is worth noting that a single selection on its 
own is still a simplification. This is because it can be expressed as a selection fol- 
lowed by the trivial amalgamation in which no distinct entities are amalgamated. 
Similarly, a single amalgamation on its own is a simplification, since it is equal 
to the trivial selection, which selects everything, followed by the amalgamation. 

A simplification from Y to X gives a way of modelling a reason why X is 
less detailed than Y. As the earlier examples showed, X can be a simplification 
of Y for many different reasons, thus it is necessary to keep track of the specific 
amalgamations and selections involved. It is also important to be able to compose 
simplifications. If a\ is a simplification from Y to X , and ct 2 a simplification from 
X to W, we need to be able to construct a simplification cti ; <J2 from Y to W 
which represents ai followed by ct 2 . The usual notion of composition for partial 
functions provides the appropriate construction in the current context. 

3.3 Amalgamation and Selection for Graphs 

Simplifications between sets are a useful way of illustrating the concepts of selec- 
tion and amalgamation, but to handle more complex kinds of data we need more 
elaborate structures than sets. In this section simple examples of amalgamation 
and selection for graphs are introduced. Further examples appear in the detailed 
case study in section 0 

The following two examples develop the previous treatment of amalgama- 
tion and selection for sets by adding edges between the elements of the sets to 
represent how the stations are joined by railway lines. 



y /^eathrow4T\ 


X 


„ , , n 


Heathrow \ 


/Piccadilly Circus 1 






VLeicester Square ^ 


\ Londony^ 



Amalgamation 




In the amalgamation example, the two stations at Heathrow airport collapse 
into a single entity, as before, but note that the edge between them in Y is 
not present in X . This disappearance of an edge can be understood in terms of 
amalgamations of paths. Roughly speaking, a path is a sequence of zero, one, or 
more edges in which each edge in the sequence shares one end with the next edge 
in the sequence and one end with the previous edge in the sequence. However, 
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because we are using undirected edges, a more careful formal treatment is needed, 
and appears in section 0 below. In the concrete example being discussed here, 
four paths in Y become amalgamated into a single edge in X. The four paths 
distinguished in Y which are amalgamated in X are as follows. 

Heathrow 4 Heathrow 1-3 Piccadilly Circus Leicester Square 

Heathrow 1-3 Piccadilly Circus Leicester Square 

Heathrow 4 Heathrow 1-3 Piccadilly Circus 

Heathrow 1-3 Piccadilly Circus 

In the selection example, the edge in the graph X is not selected from the 
edges present in Y, but is selected from the paths in Y . The use of paths in 
both the amalgamation and selection operations is an important feature of our 
work. Formally we treat amalgamations and selections as particular kinds of 
morphisms between graphs. These morphisms are mappings taking nodes to 
nodes, but which may map edges to paths, and not merely to edges. 

A significant distinction between our work and that of both Puppo and Det- 
tori EM and Bertolotto |Her!18| is illustrated in the selection example. As a 
graph is a particular kind of 1-dimensional abstract cell complex, the technique 
of using continuous mappings between abstract cell complexes to model sim- 
plifications, which these authors use, can be applied to graphs. However, this 
technique would force us to use a mapping sending the two intermediate sta- 
tions as well as the three edges in the graph Y to the single edge in the graph 
X . Conceptually this act of amalgamating Northwood and Pinner stations with 
three railway lines appears inappropriate if we want to model the simple idea 
that our graph X is obtained from Y by omitting certain features altogether. 
While continuous mappings between abstract cell complexes may be suitable for 
some kinds of simplification, they do not seem adequate to model the concept 
of selection. 

These two examples of amalgamation and selection for graphs illustrate only 
a few of the features of our approach. As with sets, general loss of detail rela- 
tionships between graphs involve both amalgamation and selection. Examples 
showing how amalgamation and selection are combined into simplifications for 
graphs, and how simplifications of graphs are composed appear in the detailed 
case study in section 0 below. 



4 Case Study 



In this section we present a detailed case study showing how our concepts of 
amalgamation and selection can be combined to yield a notion of simplification 
for graphs. The case study is drawn from genuine examples of the railway network 
in Britain. 

The following diagram illustrates a simplification: 



Bham N.St Bham Inti 

Swansea rj/] T^^^Juston 



Birminghanio^ 



Reading J, ^ 



Swindoni, P Oxford 
Reading y_ q Paddington 



jPaddington 
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At the most detailed level, two stations in Birmingham are shown: Birmingham 
New Street and Birmingham International. These are amalgamated at the lower 
level. The line from Swansea to Birmingham New Street, as well as Swansea 
station itself are omitted at the lower level, as are the two stations intermediate 
between Reading and Birmingham. The route from Reading to Birmingham New 
Street is amalgamated with that from Reading to Birmingham International. 

The simplification is made up of a selection and an amalgamation as in the 
following diagram: 



Bham N.St 
Reading 



^ Bham Inti 

^ Euston 




Select-1 ^ 



Bham N.St Bham Inti 
Swansea q/? T^^^^usto^ 
Swindoni P Oxford 
Reading }j_ q Paddington 



Amalgamate -1 



Birminghamo..,^ 

Euston 

Reading 



Note that a selection from a graph G need not be obtained by selecting some 
of the nodes and some of the edges from G. We allow a selection to take paths 
and not just edges from G. This technique allows selections to omit intermediate 
stations, such as Swindon, without being forced to omit railway lines passing 
through such stations. This means that we can model the fact that a line joins 
Reading to Birmingham New Street, even though no single edge represents this at 
the highest level. This use of paths is an important aspect of our work, formally it 
amounts to working with morphisms between graphs which take edges to paths, 
and is detailed in section 0 below. 

The graph which appeared as the end result of the above simplification can be 
simplified further as in the following diagram. Here the two stations in London: 
Euston and Paddington have been amalgamated, as have the two routes from 
London to Birmingham. Reading station has also been omitted. 



Birmingham, 




Euston 
•^o 

Paddington 
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We now have two successive simplifications involving five graphs altogether as 
follows: 



A B 




G H K 



and we wish to express this as a single simplification. 

The basic idea is to note that from A to S we have an amalgamation (to 
H) followed by a selection (from H). It is possible to interchange these, so that 
we can obtain B from A by first performing a selection (Select-3) and then 
an amalgamation (Amalgcunate-3). A formal description of this construction is 
provided in section l5.4l below. but here we provide a diagram showing the result 
for our specific example. 



BhamN.St? 



[ '"'''^Euston 
— Paddington 







V. Amalgamate - 2 



Birmingham p 



By composing Select-3 with Select-1, and by composing Amalgcunate-3 with 
Amalgcunate-2 we obtain a single simplification: 



BhamN.St? 



Bham Inti 




Bham N.St Bham Inti 
Swansea |^^^^Euston 
SwindonI P Oxford 
Reading j / q Paddington 



Birmingham Q 



This case study has demonstrated the main points of our technique, but has not 
included the full details necessary to produce an implementation. A full account 
of the technical details is included in sectional below. 
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5 Technical Details 

To model structures and simplifications between them, including composition 
of successive simplifications, it is appropriate to use the mathematical structure 
known as a category pW95j . Categories have already been used in the context 
of multi-resolution spatial data models by Bertolotto |Ber98) . However, unlike 
Bertolotto, our treatment is based on the category Graph*, which is described 
in section IFT^ below. In order to present this material, some basic facts about 
graphs are needed first. 

5.1 The Category Graph 

The graphs used in this paper are undirected, are permitted to have loops, and 
may have multiple edges between the same pair of nodes. Much of the work in the 
paper can be carried out for directed graphs, but some aspects become slightly 
more complicated, while other aspects are easier to deal with. Limitations on 
space prevent us from giving details of both the undirected and directed cases. 

The set of all subsets having either one or two elements, of a set, TV, is 
denoted by P2TV = {{x,y} \ x,y £ TV}. A graph is then described formally 
as a pair of sets E and TV (of edges and nodes respectively), together with an 
incidence function i : if — > P2TV. 

A graph morphism / : {Ei,ii, Ni) — > {E2,i2, N2) is a pair of functions /e ■ 
El — > E2 and /at : TVi — > TV2, such that if the ends of edge e G Ei are x and 
y, then the ends of /es G E2 are /nx and /atj/. These morphisms take edges to 
edges in a way which preserves the incidence function. Given a graph morphism, 
/, the two functions fs and /at are referred to as the edge part and the node 
part of the morphism respectively. The category Graph has graphs as objects, 
and graph morphisms as its morphisms. 

5.2 The Category Graph* 

To define simplifications of graphs, we need another category which has the 
same objects as Graph, but more general morphisms. Given a graph G define 
the graph G* to have the same nodes as G, and as edges, the set of all paths in 
G. A path in G can be described by a sequence, of nodes and edges of the form 

[xo,ei,xi,e 2 , ■ ■ ■ ,ei,Xi] (1) 

where edge Ck has ends Xk-i and Xk- The case of ^ = 0 is allowed, and gives zero 
length paths which are loops on each node. Two different sequences represent 
the same path iff each is the reverse of the other. The ends of the path are xq 
and Xi- 

Sometimes it is appropriate to write a path simply as [ei, 62, . . . , e^], but in 
general this can be ambiguous. For example, consider the following graph with 
two nodes, m and n, and two edges, a and b. 

a 

m n 



b 
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The paths [m, a, n, &, m] and [n,a,m,b,n] are quite distinct, and simply using 
the sequence of edges [a, b] in this context would be ambiguous. 

The * construction is applicable not only to graphs, but also to morphisms. 
Given any graph morphism / : G — > i/ we can construct a graph morphism 
f*:G*^ H*. The morphism /* has the same effect as / on nodes, and takes 
the edge QD above to [fxo, fei, /a;i, /c 2 , . . . , feg, fxe]. 

The graph G can always be embedded in G* by the Graph morphism rjc ■ 
G ^ G* which takes each node to itself, and an edge e to the path [e] of length 
1. Repeating the * construction leads to a graph (G*)*. This has the same nodes 
as G and G* , but the edges are paths of paths of G, which have the form 

[Xq , [xq , ,Xi],Xi,[xi,(J2, X 2 ] , ■ ■ ■ , Xn— \ , [Xt^— 1 , , Xn\ , Xn\ (2) 

where each ak is a sequence of edges and nodes of G, of the form 

1 yi 5 ^2 5 ■ ■ ■ 5 — 1 5 2 /m — 1 j ■ 

It is possible to reduce, or ‘ffatten’, an edge in (G*)* to one in G* by mapping 
the edge 0 above to [xq, tri, Xi, (T 2 , X 2 , . . . , x„_i, cr„, x„]. This assignment gives 
the edge part of a morphism flatc : (G*)* ^ G* which is the identity on nodes. 

The * construction allows us to define the category Graph*. This has the 
same objects as Graph, but morphisms from G to i? in Graph* are ordinary 
graph morphisms from G to H* . Given morphisms f : G ^ H and g ■. H ^ K 
in Graph*, their composition is given by 

G ^ H* ^ K* 



5.3 Selection and Amalgamation for Graphs 



Definition 1 A selection from a graph G is a subgraph A of G* such that for 
each path tt in G, there is at most one path if in A where flattening if yields tt. 

The following example should help to clarify this definition. 
n 




G 




In the above diagram we have a graph, G, and a graph H . The graph 77 is a 
subgraph of G*, but is not a selection from G. This is because the two paths 
in H : [m, [to, a, n] , n, [n,b,p],p] and [to, [to, a, n, 6, p[ , p] both flatten to the same 
path [to, a, n, &,p] in G. 

If A is a selection from G and G is a selection from A, one might hope 
that G would be a selection from G. However, this does not happen, as can 
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be seen from a simple example. Let G be the graph with three edges and 
a b c 

four nodes m n p q, and let A be the selection from G: 



[m,a,n,b,p] [p,c,q] 

P 



q. The graph G: m 



[m, [m,a,n,b,p],p, [p, c, q],q] 



q IS 



a selection from A, but not a selection from G, since it is a selection from G*. 
This failure of selections to compose is overcome by noting that applying flatc 
to C yields a selection from G which is isomorphic to C itself. For our specific 



example, flatc C is the graph m 



[m, a, n, b,p, c, q] 



Definition 2 An amalgamation for a graph G is a Graph morphism a : G 
H* , such that the node part of the morphism, un, is surjective, and for every 
edge e of H there is some edge e' of G for which aEc' = [e], where cue is the 
edge part of the morphism. 



Definition 3 A simplification from a graph G to a graph H , is a pair {A, a) 
where A is a selection from G, and a : A^ H* is an amalgamation of A. 



5.4 A Construction for Composing Simplifications 



(A, a) (B, (3) 

If we have two successive simplifications of graphs G — — > H K we 

{A, Of); {B, (3) 

need to be able to compose them to give a simplification G ^ > K. 

This is done by first constructing the graph C, which is the largest subgraph, X, 
of A* such that a* X = B. By restricting a* to G, we obtain a Graph morphism 
S : G ^ B* , so that B is an amalgamation of G. 

Now C is a subgraph of A*, and hence of (G*)* , whereas for a simplifica- 
tion from G to A, we need a subgraph of G*. By applying the construction 
for selections of selections above, we get flatcG as a selection from G, and an 
isomorphism (p : flat^G ^ G. Finally we get the definition of the composite 
simplification 

(A, a); {B, /3) = (flatoG, p-, S; (3), 



where p] <5; [3 denotes the composite of these three morphisms in Graph*. The 
overall picture is seen in this diagram in the category Graph* : 




With this method of composing simplifications, we have a category where the 
objects are graphs, and the morphisms are simplifications. 
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6 Conclusions and Further Work 

The work described in this paper has concerned explication of and distinction 
between the generalization operations that select and amalgamate data objects 
in a data model of a spatial phenomenon. In previous work, the functions of 
these two conceptually quite distinct operations have been conflated. We have 
termed such operations simplifications, and have provided a formal treatment 
of simplification in the case where the underlying data structure is a graph. In 
particular, we have shown that the appropriate formal home for these structures 
is the category Graph*, and in that context it is possible to provide a construc- 
tion for composition of simplification operations. We justified concentration on 
graph data structures because they were simple enough to show up clearly the 
main structural features of our approach, while at the same time being useful in 
spatial information handling with a history of treatment by SIS researchers. 

Our approach is limited in several ways. Firstly, the simplification operators 
considered are in no way claimed to be a complete set of generalization opera- 
tions, and further work is required to incorporate into this framework a richer 
collection of generalization operations. Secondly, the graph data structures are 
one-dimensional. The next step in the work is to consider simplification oper- 
ators in fully two-dimensional data structures, in particular 2-complexes. The 
technical details for this case are more difflcult. For example, the notion of a 
path of edges in a graph must be generalized to a gluing together of faces in a 
2-complex. Work in this direction will be reported in a future publication. 

There is one particular construction in which graph data structures have 
immediate application to fully two-dimensional data, and that is to areal de- 
compositions of the plane. The dual graph of such a decomposition is a graph, 
where areas in the decomposition are nodes of the graph and two nodes are 
connected by an edge in the graph if the corresponding areas are adjacent in the 
decomposition. Some forms of generalization of an areal decomposition can be 
viewed as simplifications of its dual graph. For example, merging of two adjacent 
areas in an areal decomposition is equivalent in the dual graph to amalgamation 
of two nodes and elimination of the edge between them. One direction for further 
work using dual graphs would be to develop the boundary sensitive approach 
to qualitative location [BS98I Kte99j in a multi-resolution context. A detailed 
exploration of generalizations of areal decompositions of the plane in terms of 
simplification of the dual graph will be reported in later publications. 
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Abstract. The Simplicial Multi-Complex (SMG) is a general multireso- 
lution model for representing fc-dimensional spatial objects through sim- 
plicial complexes. An SMC integrates several alternative representations 
of an object and offers simple methods for handling representations at 
variable resolution efficiently, thus providing a basis for the development 
of applications that need to manage the level-of-detail of complex ob- 
jects. In this paper, we present general query operations on such models, 
we describe and classify alternative data structures for encoding an SMC, 
and we discuss the cost and performance of such structures. 



1 Introduction 

Geometric cell complexes (meshes) have a well-established role as discrete mod- 
els of continuous domains and spatial objects in a variety of application fields, 
including Geographic Information Systems (GISs), Gomputer Aided Design, vir- 
tual reality, scientific visualization, etc. In particular, simplicial complexes (e.g., 
triangle and tetrahedra meshes) offer advantageous features such as adaptivity 
to the shape of the entity, and ease of manipulation. 

The accuracy of the representation achieved by a discrete geometric model 
is somehow related to its resolution, i.e., to the relative size and number of 
its cells. At the state-of-the-art, while the availability of data sets of larger and 
larger size allows building models at higher and higher resolution, the computing 
power and transmission bandwidth of networks are still insufficient to manage 
such models at their full resolution. The need to trade-off between accuracy of 
representation, and time and space constraints imposed by the applications has 
motivated a burst of research on Level- of- Detail (LOD). The general idea behind 
LOD can be summarized as: always use the best resolution you need - or you 
can afford - and never use more than that. In order to apply this principle, 
a mechanism is necessary, which can “administrate” resolution, by adapting a 
mesh to the needs of an application, possibly varying its resolution over different 
areas of the entity represented. 

A number of different LOD models have been proposed in the literature. 
Most of them have been developed for applications to terrain modeling in GISs 
(see, for instance, mil 101) and to surface representation in computer graphics 
and virtual reality applications (see, for instance, [130113 0), and they are 
strongly characterized by the data structures and optimization techniques they 
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adopt as well as custom tailored to perform specific operations, and to work 
on specific architectures. In this scenario, developers who would like to include 
LOD features in their applications are forced to implement their own models 
and mechanisms. On the other hand, a wide range of potential applications for 
LOD have been devised, which require a common basis of operations (see, e.g., 
0). Therefore, it seems desirable that the LOD technology is brought to a more 
mature stage, which allows developers to use it through a common interface, 
without the need to care about many details. 

In our previous work, we have developed a general model, called a Simplicial 
Multi-Complex (SMC), that can capture all LOD models based on simplicial 
complexes as special cases cniEiiii. Based on such model, we have built systems 
for managing the level of detail in terrains | 21 , and in free-form surfaces | 3 ], and 
we are currently developing an application in volume visualization. 

In this paper, we consider general operations that can be performed on LOD 
models and propose an analysis of cost and performance of their encoding data 
structures. Trade-off between cost and performance is a key issue to make the 
LOD technology suitable to a wide spectrum of applications and architectures 
in order to achieve a more homogeneous and user-transparent use of LOD. 

The Simplicial Multi-Complex is briefly described in Section El and general 
query techniques on such model are outlined in Section 0. In Section 0 we 
analyze the spatial relations among entities in the SMC, which are fundamental 
to support queries and traversal algorithms. In Section 0 we analyze different 
data structures to encode SMCs in the general case, as well as in special cases, 
and we discuss both the cost of such data structures, and their performance 
in supporting the extraction of spatial relations. In Section 0 we present some 
concluding remarks. 



2 Simplicial Multi-complexes 

In this section, we briefly review the main concepts about the Simplicial Multi- 
Complex, a dimension-independent multiresolution simplicial model which ex- 
tends the Multi- Triangulation presented in 0 . For the sake of brevity, 

this subject is treated informally here. For a formal treatment and details see 

m- ^ 

In the remainder of the paper, we denote with k and d two integer numbers 
such that 0 < k < d. A k-dimensional simplex a is the locus of points that 
can be expressed as the convex combination of A: -I- 1 affinely independent points 
in K'^, called the vertices of a. Any simplex with vertices at a subset of the 
vertices of a is called a facet of a. A (regular) k-dimensional simplicial complex 
in lE'^ is a finite set S of fc-simplices such that, for any pair of distinct simplices 
CTi, tT 2 € either cji and U 2 are disjoint, or their intersection is the set of facets 
shared by and U 2 - In what follows, a A:-simplex will be always called a cell, 
and we will deal only with complexes whose domain is a manifold (also called 
subdivided manifolds). 
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The intuitive idea behind a Simplicial Multi-Complex (SMC) is the following: 
consider a process that starts with a coarse simplicial complex and progressively 
refines it by performing a sequence of local updates (see Figure Pi. Each local 
update replaces a group of cells with another group of cells at higher resolution. 
An update U 2 in the sequence directly depends on another update U\ preceding it 
if U 2 removes some cells introduced with U\. The dependency relation between 
updates is defined as the transitive closure of the direct dependency relation. 
Only updates that depend on each other need to be performed in the given 
order; mutually independent updates can be performed in arbitary order. For 
instance, in the example of FigureP updates 3 and 4 are mutually independent, 
while update 5 depends on both; thus, we must perform update 4 first, then 
followed by 3 and 5. 




Fig. 1. A sequence of five updates (numbered 1. . . 5) progressively refining an 
initial coarse triangle mesh. The area affected by each update is shaded. 



An SMC abstracts from the totally ordered sequence by encoding a partial 
order describing the mutual dependencies between pairs of updates. Updates 
forming any subset closed with respect to the partial order, when performed in 
a consistent sequence, generate a valid simplicial complex. Thus, it is possible 
to perform more updates in some areas, and fewer updates elsewhere, hence 
obtaining a complex whose resolution is variable in space. Such an operation is 
known as selective refinement, and it is at the basis of LOD management. A few 
results of selective refinement from an SMC representing a terrain are shown in 
Figure 13 

An SMC is described by a directed acyclic graph (DAG). Each update is 
a node of the DAG, while the arcs correspond to direct dependencies between 
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(a) (b) (c) 



Fig. 2. Three meshes extracted from a two-dimensional SMC representing a 
terrain (top view), (a) The triangulation has the highest possible resolution 
inside a rectangular window, and the lowest possible resolution outside it. (b) 
Resolution inside a view frustum (wedge) is decreasing with the distance from 
its focus point, while it is arbitrarily low outside it. (c) Resolution is high only 
in the proximity of a polyline. 



updates. Each arc is labeled with the collection of all cells of its source node that 
are removed by its destination node. For convenience, we introduce two further 
nodes: a root corresponding to the update creating the initial coarse complex, 
which is connected with an arc to each update that removes some if its cells; 
and a drain, corresponding to the final deletion of the complex obtained by 
performing all updates, which is connected with an arc from each update that 
creates some of its cells. Also such arcs are labeled by cells in a consistent way. 
Figure 13 shows the SMC corresponding to the collection of updates described in 
Figured 

A front of an SMC is a set of arcs containing exactly one arc on each directed 
path from the root (see Figure |3). Since the DAG encodes a partial order, we say 
that a node is before a front if it can be reached from the root without traversing 
any arc of the front; otherwise the node is said to be after the front. Nodes lying 
before a front define a consistent set of updates, and the corresponding simplicial 
complex is formed by all cells labeling the arcs of the front HU. By sweeping a 
front through the DAG, we obtain a wide range of complexes, each characterized 
by a different resolution, possibly variable in space. 

In the applications, often an SMC is enriched with attribute information asso- 
ciated with its cells. Examples are approximation errors (measuring the distance 
of a cell from the object portion it approximates), colors, material properties, 
etc. 

3 A Fundamental Query on an SMC 

Since an SMC provides several descriptions of a spatial object, a basic query 
operation consists of selecting a complex which represents the object according 
to some user-defined resolution requirements. This basic query provides a natural 
support to variable resolution in many operations, such as: 
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(b) 

Fig. 3. (a) The SMC built over the partially ordered set of mesh updates of 
Figure Q Each node represents an update, and it shows the two sets of simplices 
removed and created in the update. Each arc represents the dependency between 
two updates, and it is labelled by the triangles created in the first update, which 
are removed in the second update. A front on the SMC contains the arcs inter- 
sected by the thick dashed line; nodes lying before the front are highlighted, (b) 
The triangle mesh associated with the front. 

— point location, i.e., finding the cell that contains a given point and such that 
its resolution meets some user-defined requirements; 

— windowing, i.e., finding a complex, that represents the portion of the object 
lying inside a box, at a user-defined resolution; 

— ray casting, i.e., finding the cells that intersect a given ray at a user-defined 
resolution; 

— perspective rendering: in this case, a complex is generated which represents 
the portion of the object lying inside the view frustum, and whose resolution 
is higher near the viewpoint and decreases with the distance from it; 
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— cut, i.e., sectioning with a hyperplane: the section is computed by first re- 
trieving the cells that intersect the given hyperplane and have a specific 
resolution. 

In the most general case, resolution requirements are expressed through a 
resolution filter, which is a user-defined function R that assigns to each cell a of 
the SMC a real value R{a). Intuitively, a resolution filter measures the “signed 
difference” between the resolution of a cell and that required by the application: 
i?(cr) > 0 means that the resolution of a is not sufficient; i?(cr) < 0 means that 
the resolution of a is higher than necessary. A cell such that R{a) < 0 is said 
feasible. 

For example, the meshes depicted in Figure Elsatisfy the following resolution 
filters: in (a), R is negative for all cells outside the window, zero for all cells 
inside it that are at the highest resolution, and positive for all others; in (b), R 
is negative for all cells outside the view frustum, while for a cell a inside it, R 
is decreasing with resolution of a, and with its distance from the focus point; 
in (c), R is negative for all cells not intersecting the polyline, zero for all cells 
intersecting it that are at the highest resolution, and positive for all others. 

The basic query on an SMC consists of retrieving the simplicial complex of 
minimum size (i.e., composed by the smallest number of cells) which satisfies 
a given resolution filter R (i.e., such that all its cells are feasible with respect 
to R). Variants of this query are also described in JJ. The basic query can be 
easily combined with a culling mechanism, which extracts only the subcomplex 
intersecting a given Region Of Interest (ROI). This localized query permits to 
implement operations like point location, windowing, etc. 

Algorithms for mesh extraction H31 0 d CH sweep a front through the 
DAG, until an associated complex formed by feasible cells is found. Minimum 
size is guaranteed by a front that lies as close as possible to the root of the SMC. 
In the case of a localized query, spatial culling based on a ROI is incorporated 
in the DAG traversal, hence using the structure of the SMC also as a sort of 
spatial index. The key operations used by extraction algorithms consist in either 
advancing the front after a node, when the resolution of the complex over that 
area is not sufficient, or moving it before a node when the resolution over that 
area is higher than required. 

The key issues that have impact on the performance of such algorithms are: 
the evaluation of the resolution function, which is application-dependent; and 
the evaluation of mutual relations that occur among different entities of the 
SMC. The cost of computing such relations is highly dependent on the amount 
of information stored in the data structure. 



4 Relations in a Simplicial Multi-complex 

In some applications, e.g., in computer graphics, it is often sufficient to represent 
a simplicial complex by the collection of its cells, where each cell is described 
by its vertices and its attributes. In other applications, e.g., in CIS, in CAD, or 
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in scientific visualization, topological relations among vertices and cells of the 
mesh must be maintained as well. A common choice is the winged data structure, 
which stores, for each cell, the {k + 1) cells adjacent to it along its {k — l)-facets 
m- Building the winged data structure for the mesh produced as the result 
of a query on the SMC can be more or less expensive, depending on the data 
structure used to encode the SMC. 

In the following, we discuss the relations among the elements of an SMC, 
which are needed in the traversal algorithms, and in building the winged data 
structure for the output mesh. 

There are essentially three kinds of relations in an SMC: 

— Relations on the DAG: they define the structure of the DAG describing the 
SMC by relating its nodes and arcs. 

— Relations between the DAG and the cells of the SMG: they define the con- 
nections between the elements of the DAG (arcs and nodes) and the cells 
forming the SMC; in the definition given in Section 0 such a connection is 
defined by labeling each arc of the DAG with the cells created by its source 
node that are removed by its destination node. 

— Relations between the simplices of the SMG: they define the relations among 
vertices and cells in the SMC. 

The relations on the DAG are the standard relations in a directed graph: 
Node- Arc (NA ), which associates with a node its incoming and its outgoing arcs; 
and Arc-Node (AN), which associates with an arc its source and its destination. 

The relations between the DAG and the cells of the SMC can be defined as 
follows: 

— Arc-Gell (AG) relation, which associates with an arc of the DAG the collec- 
tion of the cells labeling it. 

— Gell-Arc ( GA ) relation, which associates with a cell a of the SMC the arc of 
the DAG whose label contains a. 

— Node-Gell (NG) relation, which associates with a node U the cells created 
and deleted by the corresponding update. 

— Gell-Node (GN) relation, which associates with a cell a the node U intro- 
ducing a in its corresponding update, and the node U' removing a. 

The relations between the simplices in an SMC we are interested in are: 

— the relation between a cell and its vertices, that we call Gell- Vertex (GV) 
relation; 

— the adjacency relation between two cells, which share & (k — l)-facet, that 
we call a Gell-Gell (GG) relation. 

Since not all cells sharing a {k — l)-facet in the SMC can coexist in a cell 
complex extracted from it, we specialize the CC relation further into four differ- 
ent relations that will be used in the context of data structures and algorithms 
discussed in the following (see also Figure E|). Given two cells (Ti and (T 2 that 
share a (k — l)-facet a": 
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Fig. 4. A fragment of the SMC of Figure Eland CC relations involving simplex 
a. At edge piP2, relation co-CCi and C0-CC2 both give simplex cji; relations 
counter-CCi and counter-CC2 are not defined. At edge P2P3 no CC relation is 
defined. At edge pspi, relation co-CCi is not defined, relation counter-CCi gives 
(T3; relation C0-CC2 gives ct 2 and counter-CC2 is not defined. 

1 . (Ti and (72 are co-CCi at a" if and only if ai, U2 have been removed by the 
same update (i.e., they label either the same arc or two arcs entering the 
same node); 

2 . a I and G2 are C0-CC2 at tr" if and only if (Ti,(T 2 have been created by the 
same update (i.e., they label either the same arc or two arcs leaving the same 
node); 

3 . (72 is counter-CC\^2 to C 7 i at a" if and only if, G2 is created by the update 
that removes a\ (i.e., the arc containing cti and that containing U2 enter and 
leave the same node, respectively); 

4 . (72 is counter-CC2,i to (Ji at a" if and only if U2 is removed by the update 
that creates a\ (i.e., the arc containing ui and that containing (72 leave and 
enter the same node, respectively). 

Relations co-CCi and counter-CCi,2 are mutually exclusive: a /c-simplex can- 
not have both a co-CCi, and a counter-CCi^2 cell at the same (fc — l)-facet. The 
same property holds for relations C0-CC2 and counter-CC2,i. The above four 
relations do not capture all possible CC relations among cells in an SMC, but 
they are sufficient to support efficient reconstruction algorithms, as explained in 
the following. 

Relations CV and CC, defined in the context of a mesh extracted from an 
SMC by the algorithms described in Section El also characterize the winged data 
structure. Now, let us assume that we want to encode our output mesh through 
such a data structure. We have three options: 

1 . Adjacency reconstruction as a post-processing step: the extraction algorithm 
returns just a collection of cells and vertices, together with the CV relation; 
pairs of adjacent (CC) cells in the output mesh are found through a sorting 
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process. This takes 0{m{k -I- 1) log(m(A: -|- 1))) time, where m is the number 
of cells in the mesh, and k is the dimension of the complex. 

2. Incremental adjacency update: the pairs of adjacent cells in the output mesh 
are determined and updated while traversing the SMC, encoded with a data 
structure that maintains the four relations co-CCi, C 0 -CC 2 , counter-CCi _2 
and counter-CC 2 ,i- 

In the extraction algorithms, when the current front is advanced after a 
node, the pairs of mutually adjacent new cells introduced in the current mesh 
are determined by looking at C 0 -CC 2 relations in the SMC; the adjacency 
relations involving a new cell, and a cell that was already present in the 
output mesh, are determined by using relation counter-CC 2 ,i and adjacency 
relations of the old cells replaced by the update. Symmetrically, when the 
current front is moved before a node, relations co-CCi and counter-CCi _2 
permit updating adjacency relations in the current mesh. The total time is 
linear in the number of cells swept from one side to the other of the front. 

3. Incremental adjacency reconstruction: same as approach 2, but without en- 
coding CC relations of the SMC. 

In this case, when sweeping a front through a node (either forward or back- 
ward), a process of adjacency reconstruction similar to that used in approach 
1 is applied, locally to the part of the current complex formed by the new 
cells introduced in the current mesh, and the cells adjacent to those deleted 
by the update operation. The time required is 0{ngweep log M), where Usweep 
is the number of swept cells, and M is the maximum number of cells removed 
and created by the update contained in a swept node. 



5 Data Structures 



Encoding an SMC introduces some overhead with respect to maintaining just 
the mesh at the highest possible resolution that can be extracted from it. This is 
indeed the cost of the mechanism for handling multiresolution. However, we can 
trade-off between the space requirements of a data structure and the performance 
of the query algorithms that work on it. 

From the discussion of previous sections, it follows that basic requirements 
for a data structure encoding an SMC are to support selective refinement (as 
outlined in Section 0) , and to support the extraction of application-dependent 
attributes related to vertices and cells. Moreover, a data structure should support 
the efficient reconstruction of spatial relationships of an output mesh, for those 
applications that require it. 

In the following subsections, we describe and compare some alternative data 
structures that have different costs and performances. Those described in Sec- 
tions and |SI can be used for any SMC, while those described in Section |S1 
can be used only for a class of SMCs built through specific update operations. 
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5.1 Explicit Data Structures 

An explicit data structure directly represents the structure of the DAG describing 
an SMC. It is characterized by the following information: 

— For each vertex, its coordinates. 

— For each cell: the Cell- Vertex and the Cell-Arc relations, plus possibly a 
subset of Cell-Cell relations (as described below). 

— For each node: the Node-Arc relation. 

— For each arc: the Arc-Node relation, and the Arc-Cell relation. 

Depending on the specific application, additional information may be at- 
tached to vertices and/or cells (e.g., approximation errors for simplices). Here, 
we do not take into account such extra information. 

Assuming that any piece of information takes one unit, the space required 
by this data structure, except for adjacency relations and attributes, is equal to 
dv -|- (A: -|- 3)s -|- 4a, where v, s and a denote the number of vertices, cells, arcs in 
the SMC, respectively. Note that 4a is the cost of storing the NA plus the AN 
relations (i.e., the DAG structure), while 2s is the cost of storing the CA and 
AC relations, i.e. information connecting the DAG and the cells of the SMC. 

We consider three variants of adjacency information that can be stored for 
each cell tr: 

— Full-adjacency: all four adjacency relations are maintained: co-CCi, C 0 -CC 2 , 
counter-CCi ,2 and counter-CC 2 ,i. For each (fc — l)-facet of a, co-CCi and 
counter-CCi _2 are stored in the same physical link, since they cannot be 
both defined; similarily, C 0 -CC 2 and counter-CC 2 ,i are stored in the same 
physical link. Thus, we have 2(k -I- 1) links for each simplex. 

— Half- adjacency: only relations C 0 -CC 2 and counter-CC 2 ,i are stored, by using 
the same physical link for each edge e of cr, thus requiring (k -\- 1) links. 

— Zero-adjacency: no adjacency relation is stored. 

The version with full-adjacency can support incremental adjacency update 
(see approach 2 in Section 0. The version with half-adjacency can support in- 
cremental adjacency update only when advancing the front after a node. With 
zero-adjacency, adjacency reconstruction must be performed, either as a post- 
processing (approach 1), or incrementally (approach 3). 

Figure|5|compares the performance of query algorithms on the zero-adjacency 
data structure without adjacency generation, and with adjacencies reconstructed 
through approaches 1 and 3. Adjacency reconstruction increases query times of 
almost a factor of ten. Therefore, it seems desirable that adjacency information 
are maintained in the SMC data structure whenever they are necessary in the 
output mesh, provided that the additional storage cost can be sustained. 

5.2 A Data Structure Based on Adjacency Relations 

In this section, we describe a data structure that represents the partial order 
which defines the SMC implicitly, i.e., without encoding the DAG, but only 
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Fig. 5. Query times with adjacency reconstruction on a two-dimensional SMC 
using approach 1 and 3; the dotted curve represents query time without ad- 
jacency generation. The horizontal axis reports the number of triangles in the 
output mesh, the vertical axis execution times (in seconds). 



using adjacency relations. The data structure stores just vertices and cells, and 
it maintains the following information: 

— For each vertex: its coordinates. 

— For each cell: the Cell- Vertex relation, and the four Cell-Cell relations, (using 

2{k + 1) links as explained in Section |OJ. 

The space required is dv -|- 3(A: -|- l)s. 

Given a cell a, removed by an update U , all other cells removed by U are 
found through co-CCi starting at a. Among such cells, a cell a" is found which 
has at least one counter-CC 2 .i simplex a" . Finally, starting from cr" (which is a 
cell created hy U), all remaining cells created by U are found by using relation co- 
CC 2 . These properties allow us to update the current mesh after any movement 
of the current front. 

The size of the adjacency-based data structure is always larger than that of 
the explicit structure with zero-adjacency, while it is comparable with that of 
the explicit data structure encoding some adjacency. Note that the space needed 
to store adjacency relations tends to explode when the dimension k of the model 
increases. Implementing query algorithms on the adjacency-based structure is 
more involved than on the explicit structure, and it requires maintaining larger 
temporary structures for encoding the internal state (see [H] for details) . There- 
fore, this data structure should be preferred over to the explicit ones only if ad- 
jacency relations are fundamental in the output structure, and the dimension of 
the problem is low. However, the performance of the query algorithms is likely 
to degrade with respect to the case of explicit data structures. Storage costs for 
k = 2,3 are compared in Table D 
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k=d=2 



structure 


space 


explicit (zero-adj) 


2v + 5s + 4a 


explicit (half-adj) 


2v + 8s + 4a 


explicit (full-adj) 


2v + 11s + 4a 


adj-based 


2v + 9s 



Table 1. Space requirements of the 
structures for k = 2, 3. 



k=d=3 



structure 


Space 


explicit (zero-adj) 


3v -b 6s + 4a 


explicit (half-adj) 


3v -b 10s -b 4a 


explicit (full-adj) 


3v -b 14s -b 4a 


adj-based 


3v -b 12s 



explicit and the adjacency-based data 



5.3 Compressed Data Structures 

Much of the cost of data structures presented in the previous sections is due to 
the explicit representation of the cells and of the cell-oriented relations in the 
SMC. Indeed, the total number of cells is usually quite larger than the number 
of vertices, arcs, and nodes involved in the model, and relations among cells 
and vertices are expensive to maintain: for instance, a cell needs A: -|- 1 vertex 
references for representing relation CV. 

In some cases, the structure of every update exhibits a specific pattern, which 
allows us to compress information by representing cells implicitly. Examples of 
update patterns commonly used in building LOD models for surfaces are: ver- 
tex insertion, which is performed by inserting a new vertex and retriangulating 
its surrounding polytope consequently; and vertex split, which is performed by 
expanding a vertex into an edge and warping its surrounding cells consequently. 
Such update patterns are well defined in any dimension d, and they are depicted 
in Figure El for the two-dimensional case. 




VERTEX INSERTION 





VERTEX SPLIT 



Fig. 6. Two types of update patterns that allow the design of compressed data 
structures for an MC. The shaded triangles are those involved in the update. 



Since each update U exhibits a predefined pattern, the set of cells it in- 
troduces can be encoded by storing just a few parameters within U, that are 
sufficient to describe how the current complex must be modified when sweeping 
the front, either forward or backward, through U . The type and number of pa- 
rameters depend on the specific type of update for which a certain compressed 
data structure is designed. 
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The generic scheme for a compressed data structure encodes just the ver- 
tices, nodes, and arcs of an SMC. For vertices, it stores the same information 
as the standard explicit structure; for nodes and arcs it encodes the following 
information: 

— For each node U: the Node- Arc relation, plus an implicit description of the 
cells defining the update described by U, i.e., an implict encoding of the 
combination of Node-Cell and Cell- Vertex relations. 

— For each arc: the Arc-Node relation. 

The space required (except for the implicit encoding of the NC relation com- 
bined with the CV one) is equal to dv + 4a. 

Note that, since cells do not exist as individual entities, attribute informa- 
tion for them cannot be encoded explicitly. This means that, while the exact 
geometry of cells can be obtained through suitable mechanisms associated with 
update parameters, their attributes can only be approximated through informa- 
tion associated with nodes. In other words, attributes on a node U summarize 
attributes of the whole group of cells associated with it. In this sense, such a 
structure is lossy, because it cannot discriminate between attributes of different 
cells in the context of the same node. Since the evaluation of the resolution filter 
may depend on cell attributes, because of the approximation, a given cell a may 
result unfeasible even if it was feasible, or viceversa. This fact may cause the 
extraction of a mesh that is either over- or under-refined with respect to the 
input requirements. 

Another subtle issue, which affects the performance of the extraction algo- 
rithms, is the lack of information on the update that must be applied to refine 
the mesh at a given cell. This is due to the fact that cells are associated with 
nodes, rather than with arcs. When sweeping the current front after a node U, 
the state of the current mesh, and the update information stored in U , allow us 
to determine which cells are removed, and which cells are created by U . All new 
cells are tagged with U as their creator. Let cr be a cell introduced by U. If a is 
not feasible, then we should advance the front after the node [/' that removes cr. 
Unfortunately, the data structure does not provide information on which child of 
U removes a. In order to avoid cumbersome geometric tests to find U' , we adopt 
a conservative approach that advances the front after all children of U. However, 
such an approach may lead to over-refine the extracted mesh with respect to the 
output of the same query answered on a general data structure. Similar problems 
arise when sweeping the current front before a node. See El for further details. 

In the following, we describe in more detail two specific data structures for 
the case of vertex insertion. The first data structure applies to SMCs in arbitrary 
dimension d, while the second structure is specific for two-dimensional SMCs. 
Similar data structures for the case of vertex split can also be obtained, by 
combining the ideas presented here with the mechanism described in jSj for a 
single update. 
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A Structure for Delaunay SMCs We present a compressed structure de- 
signed for d-dimensional SMCs in such as the ones used to represent the 
domain of scalar fields (e.g., for d = 2, the domain of terrains, or of parametric 
surfaces), based on Delaunay simplicial complexes. A simplicial complex is called 
a Delaunay simplicial complex if the circumsphere of any of its cells does not 
contain vertices in its interior. In two dimensions, Delaunay triangulations are 
widely used in terrain modeling because of the regular shape of their triangles 
and since efficient algorithms are available to compute them. 

In a Delaunay SMC, the initial simplicial complex is a Delaunay one, and ev- 
ery other node U represents the insertion of a new vertex in a Delaunay complex. 
Thus, every extracted complex is a Delaunay complex. 

For a set of points in general positions (no d + 2 points are co-spherical) , the 
Delaunay complex is unique; thus, a Delaunay complex is completely determined 
by the set of its vertices, and the update due to the insertion of a new vertex is 
completely determined by the vertex being inserted. The data structure encodes 
cells in the following way: 

— at the root, the initial simplicial complex is encoded in the winged data 
structure; 

— for any other node U, the new vertex inserted by U is encoded (this defines an 
implicit description of the combination of the Node-Cell and the Cell- Vertex 
relations). 

The cost of storing the implicit description is just equal to v. It can be 
reduced to zero by storing vertices directly inside nodes. The total cost of this 
data structure is equal to dv + 4a, by considering the space required by the two 
DAG relations (i.e., NA and AN). 

Given a front on the SMC, the vertex stored in a node U is sufficient to 
determine how the corresponding mesh must be updated when sweeping the 
front through U , either forward or backward. This operation reduces to vertex 
insertion or deletion in a Delaunay simplicial complex. 

This compression scheme is easily implemented for 2-dimensional SMCs based 
on vertex insertions in a Delaunay triangulation. For higher values of the dimen- 
sion d, the algorithms necessary to update the current Delaunay complex become 
more difficult |0j . Deleting a point from a Delaunay simplicial complex in three 
or higher dimensions, as required when sweeping backward the front, is not easy; 
we are not aware of any existing implemented algorithm for such task, even in 
the three-dimensional case. 



A Structure Based on Edge Flips This compression scheme can encode 
any two-dimensional SMC where nodes represent vertex insertions in a triangle 
mesh. It is efficient for SMCs where the number of triangles created by each 
update is bounded by a small constant b. The basic idea is that, for each node 
U , the corresponding update (which transforms a triangle mesh not containing a 
vertex p into one containing p) can be performed by first inserting p in a greedy 
way and then performing a sequence of edge flips. This process, illustrated in 



Data Structures for Simplicial Multi-complexes 



47 



Figure Q defines an operational and implicit way of encoding the combination 
of the Node-Cell and Cell- Vertex relations. 




Fig. 7. Performing an update (insertion of a vertex p) through triangle split 
and edge flips. At each flip, the pair cri,tT 2 of triangle sharing the flipped edge 
is indicated. 



First, p is inserted by splitting one of the triangles removed by p, that we call 
the reference triangle for p. This creates three triangles incident at p. Starting 
from such initial configuration, a sequence of edge flips is performed. Each edge 
flip deletes a triangle cti incident at p, and the triangle (T 2 adjacent to a\ along 
the edge opposite to p, and replaces them with two new triangles incident at p, 
by flipping the edge common to a\ and Cf 2 - At the end, p has a fan of incident 
triangles which are exactly those introduced by update U . 

The update represented by a node U is fully described by the new vertex p, 
a reference triangle cr, and a sequence of edge flips. Edge flips are represented by 
numerical codes. Let us consider an intermediate situation where a flip replaces 
the j-th incident triangle of p in a radial order around p (e.g., in counterclockwise 
order starting from the topmost triangle) : then we use the number j to code the 
flip. The code of the first flip is in the range 0 ... 2 since, at the beginning, there 
are only three triangles incident at p. Since each edge flip increases the number 
of these triangles by one unit, the code for the j-th flip is in 0 ... j -I- 1. The total 
number of edge flips for a vertex p is t — 3, where t is the number of triangles 
created by the update U. Since t <b the flip sequence consists of at most 6—3 
integers, where the j-th integer is in the range 0 ... j -I- 1. Therefore, the sequence 
of flips can be packed in a flip code of + 2 )) = X)i= 3 (log 2 ( 0 ) = 

log 2 ((fe- 1 )!) - 1 bits. 

The reference triangle a for p is a triangle created by some update U' that 
is a parent of U in the DAG. Thus, to uniquely define cr, it is sufficient to give a 
reference to U' and an integer number identifying one of the triangles incident 
in the central vertex of U' according to a radial order (e.g., counterclockwise). 
Conventionally, we organize the data structure in such a way that parent U' is 
the first parent stored for [/; thus, there is no need to encode it explicitly. The 
number identifying a is in the range 0 ... 6 — 1. We can pack the flip code and 
the index of cr together in log 2 ( 6 !) — 1 bits. The space required for the implicit 
encoding of cells in this scheme is v((log 2 ( 6 !) — 1 )) bits. 

Extending this compression scheme to higher dimensions is a non-trivial task. 
Edelsbrunner and Shah jS| showed that insertion of a point in a Delaunay sim- 
plicial complex reduces to a sequence of flips of (fc — l)-facets. This result could 
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suggest that a coding based on flips may be possible for fc-dimensional Delaunay 
SMCs built through incremental reflnement. However, in fc-dimensions it is not 
clear how the {k — l)-facets incident at a vertex could be sorted in such a way to 
allow the definition of a compact flip code. Moreover, it is difficult to guarantee 
a bounded degree of vertices in a fc-dimensional simplicial mesh and, in any case, 
the number of flips is not guaranteed to be linear in the degree of the inserted 
vertex. 



Discussion We have compared the sizes of the two compressed structures out- 
lined above with the size of the explicit data structure, for a number of two- 
dimensional SMCs. On the average, the space occupied by the Delaunay com- 
pressed structure, and by the one based on edge flips, is about 1/4 and 1/3, 
respectively, of the space needed by the explicit structure without adjacencies. 

It is interesting to compare the performance of query algorithms on an SMC 
when it is encoded through an explicit data structure, or through a compressed 
one, and the quality of the triangle meshes produced by such algorithms. Query 
algorithms provide the same meshes for the same input parameters with all 
compressed data structures, but the performances vary depending on the amount 
of work needed for reconstructing triangles with the specific structure. 

Our experiments have shown that, if the given resolution Alter does not refer 
to triangle attributes (e.g., it depends just on the geometry and location of 
triangles in space), the mesh extracted by a query algorithm using a compressed 
or an explicit structure are the same. On the contrary, if the resolution Alter 
uses triangle attributes, then the resulting mesh may be quite different due to 
the approximation of such attributes in the compressed structures. 

We have experimented with resolution Alters that refer to approximation 
errors associated with triangles of an SMC. The resolution Alter imposes an 
upper bound on the error of triangles that can be accepted in the solution of a 
query. In the compressed structure, a single error is associated with each node 
U, defined as the maximum approximation error of the triangles created by U . 
When such triangles are reconstructed in the current mesh, they receive the 
approximation error of U, which over-estimates their true error, hence forcing 
the extraction algorithm to over-refine the solution of the query. In this case, 
meshes extracted from the compressed SMC may be twice as large as those 
obtained from the explicit structure. 

The performance of query algortihms has been monitored just for the explicit 
structure, and for the Delaunay-based compressed structure. The compressed 
structure based on edge flips is still under implementation. The explicit structure 
supports the extraction of triangle meshes formed by about 10^ cells from SMCs 
containing about 10® cells, in real-time. Query algorithms on the Delaunay-based 
compressed structure are much slower. The increase in execution times is due to 
the on-line computation of a Delaunay triangulation. We expect better results 
with the structure based on edge flips. 
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6 Concluding Remarks 

We have presented several alternative data structures for encoding a Simplicial 
Multi-Complex. 

General-purpose data structures are characterized by encoding different sub- 
sets of the basic relations between the elements of an SMC. Different alternatives 
can be selected in order to adapt to the needs of a specific task, and to trade-off 
between space and performance. The SMC has been extended to cell complexes 
in HU. However, the main difficulty in extending general-purpose data struc- 
tures to general cell complexes lies in the intrinsic complexity of data structures 
for cell complexes, compared with those for simplicial complexes. 

Compressed data structures have been defined for SMCs in the two-dimen- 
sional case, in which only the DAG structure is stored, and triangles are encoded 
through an implict rule. Only the structure for Delaunay SMC extends to three or 
more dimensions easily, even if the problem of deleting a point from a Delaunay 
mesh in three or higher dimension is solved only from a theoretical point of 
view. We plan to investigate more general compressed structures for higher- 
dimensional SMCs in the future. 

Compressed data structures can be much more compact than general-purpose 
ones, but, on the other hand, the performance of extraction algorithms can 
be degraded severely, because of additional work necessary to reconstruct the 
structure of meshes. In the two-dimensional case, it should be remarked that, 
while the Delaunay-based data structure is more compact than the one based 
on edge flips, the performance of the extraction algorithms is severely affected 
by numerical computation necessary to update the Delaunay triangulation. 

Based on the data structures presented here, we have developed an object- 
oriented library for building, manipulating and querying two-dimensional SMCs, 
which has been designed as an open-ended tool for developing applications that 
require advanced LOD features HU. In the current state of development, the li- 
brary implements both the explicit and the Delaunay-based data structures, the 
algorithms described in m, algorithms for building an SMC both for terrains 
and free form surfaces, application-dependent operations implemented on top 
of the query operations, mainly for CIS applications (interactive terrain visual- 
ization, contour map extraction, visibility computations, etc.). We are currently 
implementing the structure based on edge flips and a version of the library for 
dealing with three-dimensional SMCs for representing 3D scalar fields at variable 
resolution. 

An important issue in any application which deals with large data sets is 
designing effective strategies to use secondary storage. To this aim, we have 
been studying data structures for handling SMCs on secondary storage. In nn, 
a disk-based data structure for two-dimensional SMCs is proposed in the context 
of a terrain modeling application. Such a structure organizes a set of SMCs, 
each of which describes a subset of a larger area. Individual SMCs reside on 
separate files, and two or more of them (i.e., the ones contributing to represent a 
relevant area of space) can be merged into a single SMC when loaded into main 
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memory to answer a query. Future work involves defining and implementing 
query algorithms having a direct access to large SMCs resident on disk. 
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Abstract. It is frequently the case that spatial queries require a result set of 
objects whose scale ~ however this may be more precisely defined ~ is the same 
as that of the query window. In this paper we present an approach which 
considerably improves query performance in such cases. By adding a scale 
dimension to the schema we make the index structure explicitly “aware” of the 
scale of a spatial object. The additional dimension causes the index structure to 
cluster objects not only by geographic location but also by scale. By matching 
scales of the query window and the objects, the query then automatically 
considers only “relevant” objects. Thus, for example, a query window 
encompassing an entire world map of political boundaries might return only 
national borders. Note that “scale” is not necessarily synonymous with “size”. 
This approach improves performance by both narrowing the initial selection 
criteria and by eliminating the need for subsequent filtering of the query result. 
In our performance measurements on databases with up to 40 million spatial 
objects, the introduction of a scale dimension decreased I/O by up to 4 orders of 
magnitude. The performance gain largely depends on the object scale 
distribution. 

We investigate a broad set of parameters that affect performance and show that 
many typical applications could benefit considerably from this technique. Its 
scalability is demonstrated by showing that the benefit increases with the size of 
the query and/or of the database. The technique is simple to apply and can be 
used with any multidimensional index structure that can index spatial extents 
and can be efficiently generalized to three or more dimensions. In our tests we 
have used the BANG index structure. 



1 Introduction 

Conventionally, an index of spatial objects takes only the physical extent of the 
objects (or their bounding boxes) into account. Queries on such an index can thus 
only constrain the location of the result objects. Any further constraints on the type or 
size of the objects can only be imposed by appending an additional filtering step to 
the result of the spatial query. This procedure is obviously inefficient in two ways: the 
initial spatial query may retrieve large numbers of irrelevant objects, which then have 
to be removed at the second step. As a result, two particular classes of spatial queries 
are still poorly served in practice: 
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1. Window queries which are further constrained to show only features 
commensurate with the size of the query window; for example, a map display 
query should retrieve only those features visible at the display scale. 

2. Window queries further constrained to show only features of a particular range of 
subtypes (“range” implying some linear classification of subtypes); for example, a 
query of political boundaries in a window encompassing the whole world might 
reasonably only return national and state/province boundaries. 

The conventional way of trying to improve the performance of queries of the first 
class is to break up the database into separate subsets of objects of more-or-less 
arbitrarily chosen ranges of sizes. For queries of the second class, the familiar 
“layering” technique of GIS helps to improve performance by correspondingly 
partitioning the database into subsets of a single type. 

But both of these are very inflexible techniques. The first moves away from 
automated indexing towards manual intervention - which is hardly an advance; and 
the second suffers from a general objection to layering: it cannot efficiently resolve all 
the subtleties of subtyping which can be better represented in an object type 
hierarchy. In addition, the need for a multiplicity of additional files and associated 
indexes introduces potential new inefficiencies. 

In this paper we present a more flexible and integrated technique that can 
considerably improve query performance under such circumstances. We achieve this 
by introducing an additional scale dimension to the data space of the entire database 
of spatial objects. This allows the introduction of a new type of query that includes a 
scale constraint on the objects returned in the query window. 

While our approach is applicable to most index structures that can handle 
multidimensional and spatial data we chose the BANG index structure for our 
implementation. The ability of this structure to mix spatial and point dimensions 
reduces the additional space overhead incurred by the scale dimension. 

With large data sets of spatial extents (40 million) our performance tests show a 
reduction in I/O accesses of up to four orders of magnitude. With smaller databases (1 
million spatial extents) the improvement is still up to a factor of nearly 600. CPU-time 
is negligible whether a scale dimension is used or not. As our approach especially 
favors large queries and large databases it is superscalable. 



2 Related Work 

There have been several approaches to separating small and large objects for 
increased performance. One such is the filter tree [SK95], which is something of a 
misnomer, since it is actually a forest of B -trees. This spatial index structure has a 

number of layers with one tree for each. Each layer i is assigned a 2' X 2' grid layer. 
Beginning at the top level 0 , an inserted spatial object “falls” through the sequence of 
increasingly dense grid layers until its bounding box hits one of the grid lines of layer 
n . The according object is then inserted in layer n-l, in which it is completely 
contained by one of the grid squares. Large objects are always in high layers whereas 
small objects generally tend to be in lower levels. The filter tree performs well for 
spatial joins, but the performance of range queries is reported to be inferior to the R- 
tree. Essentially, the filter tree partitions the set of indexed objects according to a 
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property partially correlated to size, and a separate index structure is maintained for 
each of the partitions. 

A possible way of improving the performance of range queries arises from the fact 
that some of the layers tend to contain relatively few objects so that those layers could 
be kept in a cache. The overhead of accessing several separate index structures would 
then be decreased. Unfortunately, this method does not scale with increasing database 
size unless the cache can be increased accordingly. The choice of granularity for the 
smallest grid size is also somewhat arbitrary, and the optimum would depend on the 
distribution of object sizes. For dynamic data this optimum could change over time, 
and the introduction or removal of a grid layer would be a multiple update operation. 
This makes the design unsuitable for dynamic applications with predictable update 
characteristics. 

There are two other approaches that follow a similar pattern: the layering technique 
in GIS and level-based indexing. In traditional GIS, layering partitions the database in 
separately indexed groups like "countries", "cities", "streets"... Level-based indexing 
[Kan98] in contrast separates objects more or less directly based on their extent. The 
objects are indexed in multiple indexes according to a nesting-based partitioning 
technique. As with the filter tree, this partitioning is also related to the object size, but 
uses a different approach. In tests, performance improvements could be achieved if 
most of the partitions (those with the larger objects) were indexed in main memory or 
cached. This turns out to be quite advantageous for two reasons: large objects tend to 
be inefficiently indexed and can benefit tremendously from being located in main 
memory. Furthermore, large objects tend to be in the minority in geographic 
databases (with the exception of elevation lines) and thus might reasonably fit into 
main memory. In the tests only the partition with the smallest objects (mainly indexed 
as point data) was indexed on disk. If all the indexes were kept on disk the method 
would become inefficient because of access to several separate indexes. This is the 
same problem as that faced by the filter tree and the fundamental reason why its range 
query performance is worse than that of the R-tree. 

A related approach based on a single index is described in [Kan98] where 
promotion-based indexing is presented for different index structures such as the R- 
tree and R -tree. This approach allows 'large' spatial objects to be promoted to a 
higher index level. This makes it possible to reduce the extent of a page region at the 
index level from which the object was promoted. Queries tend to perform better 
because the query region then intersects less page regions. A performance 
improvement of up to 45% is reported in tests with different types of data, although in 
some cases the improvement was considerably less. One possible disadvantage is that 
the height of the tree may increase as a consequence of the object promotions. 
However, in contrast to previous promotion methods this approach can be used with 
intersecting page regions as found in the R-tree, the R -tree and other spatial data 
structures. Promotion can be either extent-based or nesting-based. 

- It should be noted that these four approaches have a different target compared to 
ours: with the exception of the layering technique in GIS they have all been 
devised to improve performance while returning all objects geographically in 
range. In contrast, our objective is to avoid returning the complete set and to access 
only the “interesting” scales. The fundamental question is thus not whether our 
approach is better than the filter tree, level-based or promotion based indexing. 
Rather, the question is whether it is better for our specific objective of accessing 
only those objects of sufficiently large-scale to match the scale of a given query 




Spatial Indexing with a Scale Dimension 



55 



box. Although it would be a straightforward variation of all these related 
techniques to restrict access to those regions of the index structure which contain 
large objects, there are unfortunately considerable drawbacks to this approach in all 
cases (including the layering technique): 

- None of them is based on scale, but rather more or less directly on size. The scale 
and the size of an object need not be related (we will make this point using the 
example of elevation lines). If the four approaches were all changed to use scale 
instead of their original partitioning criterion then the filter tree would become a 
special case of level-based indexing; level-based indexing would become 
fundamentally identical with the layering technique; and promotion-based indexing 
would loose its ability to reduce page region sizes - its fundamental raison d’etre. 
Level-based indexing could be expected to perform quite well except for the 
overhead inherent in accessing several separate index structures. 

- All four approaches fundamentally partition the objects into a small number of 
“buckets” (layer, index-layer and containment level respectively). With a scale 
dimension, in contrast, scale is assigned on a near continuous scale - as continuous 
as the used data type allows. This allows a more graceful approximation of, for 
example, differently scaled printed maps. 



3 BANG Indexing 

We emphasize that the approach described here does not depend on any specific 
multidimensional indexing method, except that the method must be able to support an 
additional linear scale dimension in an index of extended spatial objects. For 
completeness, however, we review below the organization of the BANG index 
structure used in our tests. 



3.1 Multidimensional Point Indexing 

A BANG index [F87, F89a] has a balanced tree structure which represents a recursive 
partitioning of an n-dimensional data space into subspaces. Each subspace 
corresponds to a disk page, and the partitioning algorithm attempts to ensure that no 
page is ever less than 1/3 full ( but see [F95] ). Generally they will, like the B-tree, be 
approximately 69% full. Tuples are represented by points in the data space, and each 
data page contains either full tuples (in a clustered index) or index keys with 
associated pointers to individual tuples stored on a heap (in an unclustered index). 

When a page overflows, it is split into two by generating a sequence of regular 
binary partitions of the corresponding subspace, cycling through the dimensions of 
the data space in a fixed order, until the content of the two resulting subspaces is 
distributed as equally as possible. This partition sequence is represented by a Peano 
code (a variable-length binary string) which uniquely identifies the external boundary 
of each resulting subspace. The set of subspaces represented within each node of the 
index tree thus appears as a set of Peano code index keys. Associated with each key k 
is a pointer to a node at the next lower level of the tree. This node represents the 
subspace whose external boundary is defined by key k. 
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Fig. 1: BANG indexing 

Note however that, although an index node containing / key entries represents / 
suhspaces, there is not in general a 1 : 1 correspondence between entries and subspaces 
i.e. some subspaces may be defined by a non-singular subset of the /keys. This is a 
consequence of a unique feature of BANG indexing, whereby the partitioning 
algorithm allows one subspace to enclose another. In general, therefore, a subspace 
may have a set of internal boundaries as well as an external boundary. Thus an 
individual key k in an index node may both represent the external boundary of a 
subspace s and an element of the set of internal boundaries of a subspace which 
encloses s. 

A further consequence of this enclosure property is that a BANG index is a true prefix 
tree i.e. any leading bits which are common to all the keys within a node can be 
deleted and appended to the key representing the node at the index level above. This 
is because the Peano code of any subspace is a prefix of the Peano code of every 
subspace or tuple which it encloses. The full key representing an individual subspace 
or tuple is thus never stored explicitly in the index. It can however always be found by 
concatenating the keys encountered in the direct descent from the root of the tree to 
that subspace or tuple. 

An additional feature is that it is rarely necessary to employ all the bits of the n 
index attributes of a tuple to generate its index key. During an exact-match search, the 
index key of the query tuple is generated dynamically and incrementally as the search 
proceeds down the tree. At each index level, only as many key bits are generated as 
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are necessary to distinguish between the subspaces or tuples at that level. Thus, if a 
set of tuples has widely differing values, only very short keys are needed to 
differentiate them. If, on the other hand, the tuples all have closely similar values, 
then their index keys will share many common prefix bits which will be promoted to 
higher index levels. 

An example of a simple two-level partitioning of a data space, and the BANG 
index structure corresponding to it, is given in figure 1. Note that, because index keys 
are of variable length, each index entry includes an integer giving the length of its 
Peano code. Entries are stored in prefix order: if key a is a prefix of key b, then a 
precedes b. 

Exact-match tuple search proceeds downwards from the root in the familiar B-tree 
manner. But within an index node, a bit-wise binary search is made for the key or 
keys which match the leading bits of the search key. The fact that there may be more 
than one such matching key is a consequence of the fact that one subspace can 
enclose another. This potential location ambiguity is, however, simply resolved by 
always choosing the longest matching key, since this represents the innermost of any 
nested subspaces. 
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Fig. 2: Object cover indexing 
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Figure 1 also demonstrates a critically important feature of domain indexing: it is not 
in general necessary to represent the whole data space in the index, if large tracts of 
the space are empty. In the example of figure 1, some exact-match queries will fail at 
the root index level. In general, therefore, the average path length traversed to answer 
an exact-match query will be less than the height of the index tree. The more highly 
correlated the data, the greater the advantage of this feature compared to the 
conventional range-based indexing of the B-tree. This advantage is reinforced by the 
extremely compact index due to the (on average) very short index entries. This leads 
to high fan-out ratios and hence a reduced index tree height and consequent faster 
access for a given data set. 



3.2 Object Cover Indexing 

A method of extending BANG multidimensional point indexing to spatial object 
covers (i.e. rectangular bounding boxes) was originally proposed in [F89b]. The 
method used here is a variation on that design, which involves a dual representation. 
Point indexing is used to partition the data space into point regions according to the 
centre points of the object covers. A rectangular cover region is then associated with 
each point region, such that the cover region boundary just encloses the covers of all 
the objects in the point region. (See figure 2). Additional lower and upper bound 
coordinate pairs representing this cover region are thus added to each index entry of 
the point index representation. 

The dual is reflected in the queries: containment queries need only check for 
overlap with point regions, whereas intersection queries must check for overlap with 
the cover regions. 



4 Introducing a Scale Dimension 

We propose the introduction of an additional object attribute, which, according to 
some measure we will define, represents the scale of the indexed spatial object. 

We define the scale S{0) of an object O as a perceived value of importance or 
size. It can best be understood by considering maps with different scales: a larger- 
scale map contains larger-scale objects. In the case of a map there actually is a well- 
known measure of scale: a length I on the map relates to a multiple m of / in 
reality. The larger the value of m , the larger the scale of the map. Given a constant 
physical size of the map this results in a larger geographic area covered]] 

We define: 

- S(M) as the scale of a map M . S(M) can be easily measured and is typically 
given as l:m . 

- S (Q) as the scale of a query Q . S (Q) is loosely defined based on the scale of a 
map of the same geographic region. 



* The geographic area covered by a printed map depends on the scale and the physical size of 
the map. For simplicity we assume the physical size of the map constant. This assumption 
does not reduce generality but simplifies the discussion. 
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- S(0) as the scale of an object. This function is based on the perception of which 
objects should be represented on a map of a given scale. Given a map M, with 
scale S(M^) we assign the scale S{0^) = S{M y) to an object 0^ that is perceived 
just large-scale enough to be on the map Mj . 

For most objects we assume an automatic assignment of scale that sufficiently 
approximates human perception. For a large class of objects on a map the assignment 
of scale based on geographic extent is reasonable. Assuming automatic scale 
assignment, the value for this attribute does not depend on any information that is not 
already inherent in the original data. The explicit storage of this attribute is thus 
redundant. The important purpose of the scale value is that it causes the index 
structure not only to cluster objects geographically, but also by scale. Objects of 
“similar” scale such as the countries of Germany and Japan would be likely to be 
clustered close to each other, despite their geographic distance. 

A query Q, with scale S (Q^ ) could (transparently to the user) specify a range of 
desired object scales [^(Qi), max^^^,^]. In this query all objects O of scale 
S(0)<S(Q^) are ignored. 

Of course the introduction of an additional dimension also has a cost, and certain 
queries might be unfavorably affected by the use of a scale dimension. Specifically, 
one expects this to happen when a query is spatially so restricted that there are no 
objects of scale less than S(Q^) . In this case the clustering according to scale does not 
help and only the incurred cost counts. However, such a small query would need very 
few disk accesses, and thus any performance decrease would have very little impact. 
We verify this in the section on “Performance Evaluation”. 

In the next section we turn our attention to the criteria which should be applied 
when using the additional scale dimension. 



5 Assignment of Scale 

5.1 Scale Based on Extent 

Several important decisions will influence the success of using a scale dimension: 

1 . How is the scale of a spatial object assigned? 

2. If the scale is a function of the size of an object, should it be logarithmic or linear? 

3. If the scale is a function of the size of an object, how much smaller than the query 
can an object be to still be considered large-scale enough? 

4. Which is the domain of sizes or scales we would like to distinguish? Objects 
outside the domain of interest would be assigned the maximum / minimum scale 
value. 

Point 1: How is the scale of a spatial object assigned? Scale should be assigned to 
objects to represent human perception as closely as possible. For simplicity we based 
scale on geographic extent in our performance tests. This coincides with scale 
perception for a large set of objects typically found on maps. 
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We decided to assign the scale of an object based on the maximum of its extent in 
the x-dimension and the y-dimension. The longer side of a very long and thin object 
thus determines its scale. Another option would be to measure the surface covered by 
the minimal bounding box or the actual object, itself. The latter would have two 
disadvantages, illustrated by the following examples: lines of latitude and longitude 
would be assigned a negligibly small scale; and rivers and highways would be 
assigned a scale value depending on whether they are horizontally/vertically aligned 
(tiny bounding box) or diagonally (big bounding box). 

We thus introduce the scale function: 

5| (O) = max(extent^ (O), extent^ {O)) 



Point 2: If the scale is a function of the size of an object, should it be logarithmic or 
linear? We chose to use a logarithmic scale. To understand our reasoning, consider 3 
spatial objects A, B and C. 

5, (A) =25, (B) =45,(0 

Comparing the relative sizes of A, B and C, A relates to Z? as Z? relates to C. 
Unfortunately, an index data structure works with a linear scale. On such a linear 
scale, France and Germany would be considered further apart in scale than a city from 
a village. The user, in contrast, would consider France and Germany very similar in 
scale because of the relative difference. 

We thus do not index 5, (O) for an object O , but rather 

S^{0) = k,+k,\og,(SfO)) 

For simplicity, we assume b = 2 . A:, and will be considered further below. 

Point 3: If the scale is a function of the size of an object, how much smaller than the 
query can an object be to still be considered large-scale enough? In our tests we 
considered a factor of 256 in each dimension to be appropriate. Thus, an object of size 
of fh® query window is considered just large enough to be part of the 
answer. Any other factor could be chosen, but we will base further data on the value 
256. 

There is no concept of objects too large for a query window. Even a village-scale 
query will return the according continent. 

Point 4: Which is the domain of sizes or scales we would like to distinguish? Objects 
outside the domain of interest would be assigned the maximum / minimum scale 
value. Finally, we must decide on a reasonable scale range within which we would 
like to distinguish objects by scale. Thus, we have to choose k, and k^ in the formula 
for Sj. The naive approach would be to index the complete range: suppose the spatial 
extents of the original object are indexed by 64-bit values. On a logarithmic scale, this 
results in a domain from 0 to 63. Most objects on a map would typically only use the 
upper few values. A value of 0 on the logarithmic scale from 0 to 63 would represent 
an object with side length j/j, of the map. 
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If such small objects are of interest, the query must obviously be so restricted to a 
tiny partition of the domain that only very few disk accesses are needed. Thus we are 
mainly concerned with distinguishing between larger-scale objects. We are willing to 
consider extremely small-scale objects equal to each other (for indexing purposes) 
and truncate the lower values in the logarithmic range ([0, 63] in the example). We 
decided to index a minimum scale value for objects with length of the 

corresponding dimension’s domain. Smaller objects are indexed with the same scale 
value. 

Should we only truncate on the lower end of the logarithmic scale? Let’s consider 
two objects A and B with sizes X ^ X X4^Xs6 of the domain, respectively. It is 
clear that ij (A) (B) . Nevertheless both objects would be returned by any query 

in whose range they fall. This is because they are not too small-scale even for the 
largest-scale queries (based on our maximum factor 256 between query and returned 
object). As a result, we can also truncate the upper few values from the logarithmic 
scale. The scale we have actually used for our performance tests is: 

S2(0) = k^ +k2 logX^'i(O)) 

with b = 2. A:, = -SSkj , = OxlOOOOOOOj*^^, = 268,435,456(^^^, . 

In It able 3l the relation between object size and scale according to this function can be 
seen in the left-most two columns. 



5.2 Scale Based on Type 

Up to now, we have considered the scale of an 
object to depend on its size only. This allows 
for automatic assignment by the software, 
transparent to the user. For most types of 
objects this assignment actually is very similar 
to human perception of which objects should 
appear on a map of a given scale. 

But how does this scale assignment fare 
when we index elevation lines? Elevation lines 
have some special conceptual properties: 

- they are contained one in another and thus 
have a large intersection 

- conceptually, they do not represent different 
objects; rather they represent all geographic points with the same elevation x 

On a large-scale map elevation lines might be recorded in increments of 500 meters or 
more. On a small-scale map, in contrast, this would be too coarse and increments of 
only 1 meter might be used. Obviously, an elevation line at 1000 meters is 
conceptually considered larger-scale than at 1001 meters or 999 meters. We propose 
to assign sc ales to elevation lines based on the according elevations. 

[Table 1 informally shows one possible algorithm for the scale assignment for 
elevation lines. Which values exactly are appropriate for n and i can be easily adjusted 
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by comparing with common practice for printed maps. No te that the number of 
objects of scale x decrease exponentially with increasing x. In tTable 1| the number of 
objects with scale n is roughly 10 times the number of objects with scale n+3i. 
Further, the objects with scale n+3i are roughly 10 times more than objects with scale 
n+6i. In the section on "Performance Evaluation" we show that it is, in fact, 
advantageous to have more small-scale than large-scale objects. Similar scale 
assignments can be applied to lines showing levels of precipitation, temperature and 
other attributes. 



5.3 Initial Failures 

For the first test results we used a linear scale for the scale dimension and did not 
truncate the lower or higher range. As a result, the value of the scale attribute did not 
sufficiently influence clustering. The performance figures were worse than without a 
scale dimension. Another concern was the type of objects indexed. For an application 
using data of similarly scaled objects, our approach is clearly not appropriate. But 
geographic databases often have a wide bandwidth of differently scaled objects. This 
is shown in the section on ‘j6.2 Relevance of Data and Queries Chosen]’. 



6 Performance Evaluation 

We first present the data sets we have used and then the performance results on those 
sets. We justify the choice of data distributions in terms of typical geographic 
databases. To see the impact of different parameters we have used synthetic data sets 
instead of an actual geographic database. Furthermore, many actual geographic 
databases contain mainly point data that are more efficient to handle than spatial data. 
Other databases often contain only a rather small number of objects. We want to show 
that our approach can make many applications scalable, even when millions of truly 
spatial objects are in the database. Thus, our performance results include 
measurements on databases with up to 40 million spatial extents - a substantial and 
demanding database size. 



6.1 Type of Data and Queries 

We have run performance tests with databases with 1 million and 40 million obje cts. 
For each size of database we used 3 obj ect size distributions. The section on the “ |6.2 | 
plelevance of Data and Queries Chosen] ’ shows that distribution # 1 is most typical for 
geographic databases. 
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Object size distribution 


Performance result 
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dimension 
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dimension 


1 million 
extents 


# 1 


Fig 3 leftjTable 3| 


Fig 5 left 


Fig 5 right 


#2 


Fig 3 right; fr able 3| 


Fig 6 left 


Fig 6 right 


#3 


Fig4;jrable3| 


Fig 7 left 


Fig 7 right 


40 

million 

extents 


# 1 


Fable 3 




Fig 8 left 


Fig 8 right 


#2 


Fable 3 




Fig 9 left 


Fig 9 right 


#3 


Fable 3 




Fig 10 left 


Fig 10 right 



Table 2: The databases created for performance tests 



The object centers are randomly distributed. The object size distributions (with 1 
million extents) are specified in Fig. 3, Fig. 4 and|Table~3l 





Size of objects relative to domain in each dimension Size of objects relative to domain in each dimension 



Fig. 3: The object size affects the number of such objects. On the left-hand side the number of 
objects of a given size is inversely proportional to the area ( object size distribution # 1 ). On the 
right hand side the number of objects of a given size is inversely proportional to the side length 
( object size distribution # 2 ). 




Size of objects relative to domain in each dimension 



Fig. 4: Object size distribution # 3 : The number of objects of each size is the same (bounded by 
some minimum and maximum size). This means the objects in the database are uniformly 
distributed according to the scale dimension. 
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Table 3: Object size distributions with 1 million an 40 million spatial extents. 



6.2 Relevance of Data and Queries Chosen 

We have chosen the data and the queries to represent applications with geographic 
databases as faithfully as possible. Which parameter of the data distributions - if 
inappropriately chosen - could make the performance results (partially) irrelevant for 
typical applications? There are several candidates, which we consider subsequently to 
show that our performance results are, in fact, relevant to typical applications: 



- Choice of queries (A) 

- Choice of object locations (centers of represented objects) (B) 

- Choice of object sizes (C) 

- Assignment of object scales (D) 



(A) We have chosen a wide range of queries for each test. The query locations are 
randomly distributed, which we consider an appropriate choice. For each query size 

between 100% and ^ = 0.4% of each dimension (between (100%)^ =100% and 
(^)^ = 0.0015% of the 2-dimensional domain) we have run 100 random queries (10 

random queries when 100 queries took too lon^. The performance results in Fig 5 
through Fig. 10 show the minimum, maximum and average values for all queries. 
With respect to an application using only a narrow range of query sizes the relevant 
range can be easily extracted from the figures. 

(B) The object centers are randomly distributed in our data sets. Considering that 
there are only very few “empty” areas on maps, this seems an appropriate distribution. 



^ The results based on 10 queries of each size are marked by gray backgrounds in the graphs (in 
Fig 8, Fig. 9 and Fig. 10). 
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Note that also the oceans contain objects such as lines of depth, temperature and 
others. As we apply the same object distribution to the tests with and without scale 
dimension the object center distribution should not have any great effect on our 
performance results. 

(C) The object size distribution is comparable to the object center distribution 
considering its impact on our performance results. We consider the broad range of 
object sizes we have used typical for geographic applications. Nevertheless, whatever 
object size distribution is chosen impacts both the results with and without scale 
dimension equally. The performance difference between tests with and without scale 
dimension depends on the scales of objects, not their sizes (note the difference 
between size and scale). 

(D) The object scale distribution finally is the concept with the actual influence on 
how much our approach can help increase performance. If we have chosen non- 
typical distributions for it our performance results are of limited relevance to actual 
applications. We will thus compare our distributions with the distribution found in a 
very large digital map. Note that the object scale distribution is solely chosen to 
influence what should be returned upon a query of a given scale. Our claim is that 
exactly object scale distribution 1 will typically result in query answers similar to 
most existing printed maps. 

Informal proof: Let us assume a geographic database containing all objects that can 
be found on any existing printed map. When we ask a query on such a database we 
would perfectly expect to be returned exactly the same objects that would actually be 
on a printed map of the same area. We thus consider exactly those objects both within 
the specified geographic region and the appropriate scale interval. 

The important fact is that most printed maps we typically work with, show a very 
similar number of objects. Let’s only consider the detail information and not for 
example the country borders that are only partially visible. This detail information in 
a query (2, has scale S(Q,). Had a map much more objects than typical, it would be 
considered too detailed. Consequently, some objects should be assigned a smaller 
scale so that they only appear in smaller-scale maps. Had a map much less objects, it 
would be considered not detailed enough. Thus some objects should be assigned a 
larger scale so that they also appear in otherwise “empty” larger-scale maps. 
Obviously a scale assignment to objects is considered reasonable by cartographers if 
maps of different scales contain a roughly similar number of objects. 

What does this mean for the object scale distribution? We consider a query 
window Q. that covers a proportion a. of the complete domain. Independent of the 
scale we expect a typical number n of objects of scale S(Q) as result to this query. If 
a query Q. covering a proportion a. of the domain contains n objects of scale S. = 
S( Q) then the complete domain can typically be expected to contain roughly ^ such 

objects. Thus the overall number of objects of scale S(0) = S. tends to be inversely 
proportional to the area covered by a map (query) with scale S( Q) = S-. 

This d istributi on is exactly object scale distribution #1 presented in Fig. 3 (left 
side) and [Table j . In fact, this distribution shows the best performance results among 
the three distributions we have used. We have added the other two distributions to 
show how our method performs with adverse object scale distributions. 

If a geographic database has an object scale distribution like distribution #3 or even 
“worse” then it contains only little small-scale detail. Serious applications with large 
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geographic databases arguably can be expected to contain considerable small-scale 
detail. 



6.3 Performance Measurements 

The following figures present our performance measurements (I/O) for the different 
data sets. We have run 100 random queries of each query size (figures with white 
background). With those tests involving large numbers of disk accesses per query we 
only performed 10 random queries of each size (figures with gray background). The 
query sizes can be seen on the x-coordinate of the figures. 

Thus we have varied 

- The object size distribution (each row with a pair of figures relates to one object 
size distribution) 

- The database size (1 million spatial extents in the first 3 rows, 40 million in the last 
3 rows) 

- The query size (varied on the x-axis in each figure) 

- The use of a scale dimension (without a scale dimension on the left-hand side and 
with a scale dimension on the right-hand side) 

The vertical bars in each figure show the minimum and maximum values for single 
queries (among 10 or 100 random queries each). The curve crosses each vertical bar 
at the average number of disk accesses per random query. 



Without scale dimension 
1 million extents 

average, minimum and maximum 
of 100 random queries 



With scale dimension 
1 million extents 

average, minimum and maximum 
of 100 random queries 





Query size relative to domain in each dimension Query size relative to domain in each dimension 



Fig. 5 : Performance measurements for object size distribution 1 (1 million spatial extents) - 
according to the proof in sectionj^this is the typical distribution for large geographic databases. 



Page accesses 
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Without scale dimension 
1 million extents 

average, minimum and maximum 
of 100 random queries 



With scale dimension 
1 million extents 

average, minimum and maximum 
of 100 random queries 





Query size relative to domain in each dimension Query size relative to domain in each dimension 



Fig. 6: Performance measurements for object size distribution 2(1 million spatial extents) 



Without scale dimension 
1 million extents 

average, minimum and maximum 
of 100 random queries 




Query size reiative to domain in each dimension 



With scale dimension 
1 million extents 

average, minimum and maximum 
of 100 random queries 




Query size reiative to domain in each dimension 



Fig. 7: Performance measurements for object size distribution 3 (1 million spatial extents) 



Without scale dimension 
40 million extents 
average, minimum and maximum 
of 10 random queries 



With scale dimension 
40 million extents 
average, minimum and maximum 
of 100 random queries 





Fig. 8: Performance measurements for object size distribution 1 (40 million spatial extents) - 
according to the proof in sectionOthis is the typical distribution for large geographic databases. 
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Without scale dimension 
40 million extents 
average, minimum and maximum 
of 10 random queries 



With scale dimension 
40 million extents 
average, minimum and maximum 
of 100 random queries 




Query size relative to domain in each dimension 




Query size relative to domain in each dimension 



Fig. 9 : Performance measurements for object size distribution 2 (40 million spatial extents) 



Without scale dimension 
40 million extents 
average, minimum and maximum 
of 10 random queries 




Query size relative to domain in each dimension 



With scale dimension 
40 million extents 
average, minimum and maximum 
of 10 random queries 




Query size relative to each dimension 



Fig. 10: Performance measurements for object size distribution 3 (40 million spatial extents). 
Note that the query on the complete domain (returning 40 million objects) could not be 
measured in this case, because main memory was exhausted. 



6.4 Analysis 

In Fig. 5 through Fig. 10 it becomes clear that the scale dimension can considerably 
improve performance. Especially in the typica^ case that there are far more small- 
scale than large-scale objects (object size distribution 1, Fig. 5, Fig. 8) the 
improvement is substantial. CPU-time is negligible with and without scale dimension. 

With increasing database size (from 1 million to 40 million extents) the 
improvement factor due to the scale dimension even increases. Our technique thus 
scales very well with the size of the database. 



^ As proven in the section "Relevance Of Data And Queries Chosen" this is typical for a large 
geographic database. 
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6.5 Queries without Scale 

The question arises how queries perform without restriction of the scale. We thus 
index with scale hut make the conscious decision to query without scale. The query 
then returns all objects geographically in the query window. 

We thus have three approaches: 





Indexing 


Without scale 


With scale 


I 


r 


Querying 


Without scale 


Q 


IQ 


IQ 


With scale 


Q" 


- 


TQ" 



Table 4: Possible combinations of indexing and querying with and without scale 



It can be assumed that approach TQ results in slightly lower performance than the 
initial I Q . This is obvious for two reasons: 

- r results in a 25% larger index. This is because of the additional space 
requirements for the (point) scale dimension (1 integer). The original two (spatial) 
dimensions need two integers each. 

- r clusters by three dimensions instead of two. As Q fundamentally only uses ^ of 
the clustering potential, it can be expected to perform worse: queries have to access 
data that are more broadly spread over disk pages. 



IQ 

40 million extents 
average, minimum and maximum 
of 10 random queries 



IQ 

40 million extents 
average, minimum and maximum 
of 10 random queries 




Query size relative to domain in each dimension 



Query size relative to domain in each dimension 



Fig. 11: Performance comparison between I Q and TQ 
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Performance Penalty of kQ- vs. I-Q- 




Query Size 

Fig. 12: Relative performance penalty of FQ compared to FQ 

Fig. 12 shows a performance comparison between I Q and FQ . The performance 
penalty for using FQ is between 40% and 45% for the larger queries. In case of the 
smaller queries, the penalty increases to nearly 80%. Note that the difference is only a 
few disk accesses in absolute terms: the small queries are geographically very 
restricted. 



7 Conclusion 

We have introduced the application of an additional scale dimension to spatial data. 
Although the additional dimension increases the size of the index slightly, it can 
considerably speed up spatial queries. Specifically, it results in better scalability with 
respect to the query size: especially very large-scale queries that generally need many 
disk accesses benefit from our approach. Extremely small-scale queries that generally 
only need very few disk accesses might be slightly inhibited. 

We have presented tuning considerations that can have a considerable effect on 
performance, and have pointed out applications that typically may not benefit from 
our approach. Our approach is not suited for spatial data sets with only minimal 
variations in object scales or more large-scale than small-scale objects. 

With a scale dimension, the index structure itself is given the ability to consider 
only spatial objects that are sufficiently large-scale with respect to the query scale. 
This not only increases indexing performance considerably, but also makes a filtering 
step superfluous. It is an integrated approach to both 

- keep the query answer restricted to relevant objects 

- considerably improve performance 
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Abstract. There is an increasing need to integrate spatial index structures into 
commercial database management systems. In geographic information systems 
(GIS), huge amounts of information involving both, spatial and thematic at- 
tributes, have to be managed. Whereas relational databases are adequate for han- 
dling thematic attributes, they fail to manage spatial information efficiently. In this 
paper, we point out that neither a hybrid solution using relational databases and a 
separate spatial index nor the approach of existing object-relational database sys- 
tems provide a satisfying solution to this problem. Therefore, it is necessary to 
map the spatial information into the relational model. Promising approaches to 
this mapping are based on space-filling curves such as Z-ordering or the Hilbert 
curve. These approaches perform an embedding of the multidimensional space 
into the one-dimensional space. Unfortunately, the techniques are very sensitive to 
the suitable choice of an underlying resolution parameter if objects with a spatial 
extension such as rectangles or polygons are stored. The performance usually 
deteriorates drastically if the resolution is chosen too high or too low. Therefore, 
we present a new kind of ordering which allows an arbitrary high resolution with- 
out performance degeneration. This robustness is achieved by avoiding object du- 
plication, allowing overlapping Z-elements, by a novel coding scheme for the Z- 
elements and an optimized algorithm for query processing. The superiority of our 
technique is shown both, theoretically as well as practically with a comprehensive 
experimental evaluation. 

1. Motivation 

Index structures for spatial database systems have been extensively investigated during 
the last decade. A great variety of index structures and query processing techniques has 
been proposed [Giit 94, GG 98]. Most techniques are based on hierarchical tree-struc- 
tures such as the R-tree [Gut 84] and its variants [BKSS 90, SRF 87, BKK 97]. In these 
approaches, each node corresponds to a page of the background storage and to a region 
of the data space. 

There is an increasing interest in integrating spatial data into commercial database 
management systems. Geographic information systems (GIS) are data-intensive appli- 
cations involving both, spatial and thematic attributes. Thematic attributes are usually 
best represented in the relational model, where powerful and adequate tools for evalua- 
tion and management are available. Relational databases, however, fail to manage spa- 
tial attributes efficiently. Therefore, it is common to store thematic attributes in a rela- 
tional database system and spatial attributes outside the database in file-based 
multidimensional index structures (hybrid solution). 
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The hybrid solution bears various disadvantages. Especially the integrity of data 
stored in two ways, inside and outside the database system, is difficult to maintain. If an 
update operation involving both, spatial and thematic attributes fails in the relational 
database (e.g. due to concurrency conflicts), the corresponding update in the spatial 
index must be undone to guarantee consistency. Vice versa, if the spatial update fails, the 
corresponding update to the relational database must be aborted. For this purpose, a 
distributed commit protocol for heterogeneous database systems must be implemented, 
a time-consuming task which requires a deep knowledge of the participating systems. 
The hybrid solution involves further problems. File systems and database systems have 
usually different approaches for data security, backup and concurrent access. File-based 
storage does not guarantee physical and logical data independence. Thus, changes in 
running applications are complicated. 

A promising approach to overcome these disadvantages is based on object-relational 
database systems. Object-relational database systems are relational database systems 
which can be extended by application-specific data types (called data cartridges or data 
blades). The general idea is to define data cartridges for spatial attributes and to manage 
spatial attributes in the database. For data-intensive GIS applications it is necessary to 
implement the multidimensional index structures in the database. This requires the ac- 
cess to the block-manager of the database system, which is not granted by most commer- 
cial database systems. For instance the current universal servers by ORACFE and IN- 
FORMIX do not provide any documentation of a block-oriented interface to the 
database. Data blades/cartridges are only allowed to access relations via the SQF inter- 
face. Thus, current object-relational database systems are not very helpful for our inte- 
gration problem. 

We can summarize that anyway, using current object-relational database systems or 
pure relational database systems, the only possible way to store spatial attributes inside 
the database is to map them into the relational model. An early solution for the manage- 
ment of multidimensional data in relations is based on space-filling curves. Space-filling 
curves map points of a multidimensional space to one-dimensional values. The mapping 
is distance preserving in the sense that points which are close to each other in the multi- 
dimensional space, are likely to be close to each other in the one-dimensional space. 
Although distance-preservation is not strict in this concept, the search for matching 
objects is usually restricted to a limited area in the embedding space. 

The concept of space-filling curves has been extended to handle polygons. This idea 
is based on the decomposition of the polygons according to the space-filling curve. We 
will discuss this approach in section 2 and reveal its major disadvantage that it is very 
sensitive to a suitable choice of the resolution parameter. We will present a new method 
for applying space-filling curves to spatially extended objects which is not based on 
decomposition and avoids the associated problems. 

For concreteness, we concentrate us in this paper on the implementation of the first 
filter step for queries with a given query region such as window queries or range queries. 
Further filter steps and the refinement step are beyond the scope of this paper. The rest 
of this paper is organized as follows: In section 2, we introduce space-filling curves and 
review the related work. Section 3 explains the general idea and gives an overview of our 
solution. The following sections show how operations such as insert, delete and search 
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are handled. In section 5, a comprehensive experimental evaluation of our technique 
using the relational database management system ORACLE 8 is performed, showing the 
superiority of our approach over standard query processing techniques and competitive 
approaches. 



2. Z-Ordering 

Z-Ordering is based on a recursive decomposition of the data space as it is provided by 
a space-filling curve [Sag 94, Sam 89] called Z-ort/e ring [OM 84], Peano/Morton Curve 
[Mor 66], Quad Codes [FB 74] ox Locational Codes [AS 83]. 



22 


23 


3 


20 


fl2 


213 


210 


211 


0 


1 



Figure 1: Z-Ordering. 



2.1 Z-Ordering in Point Databases 

Assume a point taken from the two-dimensional unit 
square [0..1] . The algorithm partitions the unit square into 
4 quadrants of equal size (we change the description of Z- 
ordering here slightly to make it more comparable to our ap- 
proach), which are canonically numbered from 0 to 3 (cf. 
figure 1). We note the number of the quadrant and partition 
this quadrant into its four sub-quadrants. This is recursive- 
ly repeated until a certain basic resolution is reached. The 
fixed number of recursive iterations is called the resolution 
level g. Then we stop and use the obtained sequence of g 
digits (called quadrant sequence) as ordering key for the points (we order lexicographi- 
cally). Each quadrant sequence represents a region of the data space called element. For 
instance, the sequence <00> stands for an element with side length 0.25 touching the 
lower left corner of the data space. Elements at the basic resolution which are represent- 
ed by quadrant sequences of length g are called cells. If an element gj is contained in 
another element e 2 , then the corresponding quadrant sequence Q(e 2 ) is a prefix of QieQ. 
The longer a quadrant sequence is, the smaller is the corresponding element. In the unit 
square, the area of an element represented by a sequence of length I is (1/4)1 In a point 
database, only cells at the basic resolution are used. Therefore, all quadrant sequences 
have the same lengths and we can interpret the quadrant sequences as numbers repre- 
sented in the quaternary system (i.e. base 4). Interpreting sequences as numbers facili- 
tates their management in the index and does not change ordering of the points, because 
the lexicographical order corresponds to the less-equal relation of numbers. The points 
are managed in an order-preserving one-dimensional index structure such as a B''‘-tree. 



2.2 Query Processing in Z-Ordering 

Assume a window query with a specified window. The data space is decomposed into its 
four quadrants. Each quadrant is tested for intersection with the query window. If the 
quadrant does not intersect the query window, nothing has to be done. If the quadrant is 
completely enclosed in the query window, we have to retrieve all points from the data- 
base having the quadrant sequence of this element as a prefix of their keys. If the keys 
are represented as integer numbers (cf. section 3.2), we have to retrieve an interval of 
subsequent numbers. All remaining quadrants which are intersected by the window but 
not completely enclosed in the window (i.e. “real” intersections) are decomposed recur- 
sively until the basic resolution is reached. 
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Figure 2: The one-value-representation. 

2.3 A Naive Approach for Polygon Databases 

To extend the concept of Z-ordering for the management of objects with a spatial exten- 
sion (e.g. rectangles or polygons), we face the problem that a given polygon intersects 
with many cells. A naive approach could be to store every cell covered by the object in 
the database. Obviously, this method causes a huge storage overhead unless the basic 
grid is very coarse. Therefore, several methods have been proposed to reduce the over- 
head when using a finer grid. 

2.4 One- Value-Approximation 

The objects are approximated by the smallest element which encloses the complete 
object (cf. figure 2). In this case our recursive algorithm for the determination of the 
quadrant sequence is modified as follows: Partition the current data space into four 
quadrants. If exactly one quadrant is intersected by the object, proceed recursively with 
this quadrant. If more than one quadrant is intersected, then stop. Use the quadrant 
sequence obtained up to that point as the ordering key. This method has the obvious 
advantage that each object is represented by a single key, not by a set of keys as in our 
naive approach. But this method yields also several disadvantages. The first disadvan- 
tage is that the quadrant sequences in this approach have different lengths, depending on 
the resolution of the smallest enclosing quadrant. Thus, our simple interpretation as a 
numerical value is not possible. Keys must be stored as strings with variable length and 
compared lexicographically, which is less efficient than numerical comparisons. The 
second problem is that objects may be represented very poorly. For instance any polygon 
intersecting one of the axis-parallel lines in the middle of the data space (the line x = 0.5 
and the line y = 0.5) can only be approximated by the empty quadrant sequence. If the 
polygon to be approximated is very large, an approximation by the empty sequence or 
by very short sequences seems to be justified. For small polygons, the relative approxi- 
mation error is too large. The relative space overhead of the object approximation is thus 
unlimited. In fact, objects approximated by the empty quadrant sequence are candidates 
to every query a user asks. The more objects with short quadrant sequences are stored in 
the database, the worse is the selectivity of the index. 

2.5 Optimized Redundancy 

To avoid the unlimited approximation overhead, Orenstein proposes a combination of 
the naive approach and the one-sequence representation [Ore 89a, Ore 89b]. He adopts 
the idea of the object decomposition in the naive approach, but does not necessarily 
decompose the object until the basic resolution is reached. Instead, he proposes two 
different criteria, called size-bound and error-bound to control the number of quadrants 
into which an object is decomposed. Each subobject is stored in the index by using its 
quadrant sequence, e.g. represented as a string. Although this concept involves object 






XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extension 



79 



duplication (which is called redundancy by Orenstein), the number of records stored in 
the index is not directly determined by the grid resolution as in the naive approach. 
Unlike in the one-sequence approach, it is not necessary to represent small objects by the 
empty sequence or by very short sequences. According to Orenstein, typically a decom- 
position into 2-4 parts is sufficient for a satisfactory search performance. 

Orenstein’s approach alleviates the problems of the two previous approaches, but a 
duplicate elimination is still required and the keys are sequences with varying length. 
Orenstein determines an optimal degree of redundancy only experimentally. An analyt- 
ical solution was proposed by Gaede [Gae 95] who identified the complexity of the 
stored polygons, described by their perimeter and their fractal dimension, as the main 
parameters for optimization. A further problem when redundancy is allowed arises in 
connection with secondary filters in a multi-step environment. Information which can be 
exploited for fast filtering of false hits, such as additional conservative approximations 
(e.g. minimum bounding rectangles MBR), should not be subject to duplication due to its 
high storage requirement. To avoid duplication of such information, it must be stored in 
a separate table which implies additional joins in query processing. 

A further consequence of Gaede’s analysis [Gae 95] is that the number of intervals 
which are generated from the query window is proportional to the number of grid cells 
intersected by the boundary of the query window (i.e. its perimeter). This means that a 
too fine resolution of the grid leads to a large number of intervals and thus to deteriorated 
performance behavior when a relational database system is used. The reason is that the 
intervals must be transferred to and processed by the database server, which is not neg- 
ligible, if the number of intervals is very high (e.g. in the thousands). 

2.6 Alternative Techniques 

Several improvements of the Z-ordering concept are well-known (cf. figure 3). Some 
authors propose the use of different curves such as Gray Codes [Fal 86, Fal 88], the 
Hilbert Curve [FR 89, Jag 90] or other variations [Kum94]. Many studies [Oos 90, 
Jag 90, FR 91] prefer the Hilbert curve among the proposals, due to its best distance 
preservation properties (also called spatial clustering properties). [Klu 98] proposes a 
great variety of space-filling curves and makes a comprehensive performance study us- 
ing a relational implementation. As this performance evaluation [Klu 98] does not yield 
a substantial performance improvement of the Hilbert curve or other space-filling curves 
over Z-ordering, we use the Peano/Morton curve because it is easier to compute. 




Figure 3: Various Space-Filling Curves. 
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3. A Space-Filling Curve for Spatially Extended Objects 

In contrast to the previous approaches, we propose a solution which avoids the disadvan- 
tages of object duplication and variable-length quadrant sequences. Our method is thus 
completely insensitive against a too fine grid resolution. There is no need to optimize the 
resolution parameter. It can always be taken as fine as possible, i.e. the full bit precision 
of the CPU can be exploited. Three ideas are applied to achieve this robustness: The first 
idea presented in section 3.1 is to incorporate overlap into the concept of elements. We 
will define the elements such that adjacent elements at the same resolution level I over- 
lap each other up to 50%. This method enables us to store objects without redundancy 
(i.e. object duplication) and without uncontrolled approximation error. In particular, it is 
impossible that a very small object must be represented by a very short sequence or even 
by the empty sequence. 

The second idea is to use a sophisticated coding scheme for the quadrant sequences 
which maps quadrant sequences with varying length into the integer domain in a dis- 
tance-preserving way. The coding algorithm is presented in section 3.2. The third idea 
(cf. section 4.3) is an efficient algorithm for interval generation in query processing. The 
goal of the algorithm is to close small gaps between adjacent intervals if the overhead of 
processing an additional interval is larger than the cost for overreading the interval gap. 

We call our technique which maps polygons into integer-values extended Z-ordering 
or XZ-ordering. The integer values forming the keys for search are called XZ-values. For 
each polygon in the database, we store one record which contains its XZ-value and a 
pointer to the exact geometry representation of the polygon. As we avoid object dupli- 
cation, further information such as thematic attributes or information for secondary fil- 
ters (the MBR or other conservative and progressive approximations, cf. [BKS 93]) can 
be stored in the same table. 

3.1 Overlapping Cells and Elements 

The most important problem of the one-value representation is that several objects are 
approximated very poorly. Every object intersecting with one of the axis-parallel lines 
X = 0.5 and y = 0.5 is represented by the empty quadrant sequence which characterizes 
the element comprising of the complete data space. If the object extension is very small 
(close to 0), the relative approximation error diverges to infinity. 

In fact, every technique which decomposes the space into disjoint cells gets into trou- 
ble if an object is located on the boundary between large elements. Therefore, we modify 
our definition of elements such that overlap among elements on the same resolution 
level I is allowed. 

The easiest way to envisage a definition of overlapping elements is to take the original 
elements as obtained by Z-ordering and to enlarge the height and width by a factor 2 
upwards and to the right, as depicted in figure 4. Then, two adjacent cells overlap each 
other by 50%. The special advantage is, that this definition contains also small elements 
for objects intersecting with the middle axis. 

Definition 1: Enlarged elements 

The lower left corner of an enlarged element corresponds to the lower left corner of 

Z-ordering. Let s be the quadrant sequence of this lower left corner and let El denote 
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its length. The upper right corner is translated such that the height and width of the 
element is 

It is even possible to guarantee bounds for a minimal length of the quadrant sequence 
(and thus of the approximation quality) based on the extension of the object in x- and y- 
direction. 

Lemma 1. Minimum and Maximum Length of the Quadrant Sequence 

The length |i| of the quadrant sequence s for an object with height h and width w is 
bounded by the following limits: 

/j < l^l < I 2 with /j = l^logQ j(max{w, /j} )J and I 2 = 1^+2 

Proof (Lemma 1) 

Without loss of generality, we assume w> h . We consider the two disjoint space 
decompositions into elements at the resolution levels /j and I 2 and call the arising 
decomposition grids the /j-grid and the / 2 -grid, respectively. The distances Wj and W 2 
between the grid-lines are equal to the widths of the elements at the corresponding 
decomposition levels /j and l 2 - 

(1) As the distance Wj between two lines of the /j-grid is greater than or equal to w, 

because Wj = O.s'' = 0.5L‘°§o-5(’^^J > 0.5‘°®«5W = 

the object can at most be intersected by one grid line parallel to the y-axis and by one 
grid line parallel to the x-axis. If the lower left element among the intersecting ele- 
ments is enlarged as in definition 1, the object must be completely contained in the 
enlargement. 

(2) As the distance W 2 between two lines of the / 2 -grid is smaller than wl2, 

because W 2 = 0.5^^ = + 2 ^ Q^^loggjW + 1 _ ^ 

the object is intersected at least by two y-axis parallel lines of the / 2 -grid. Therefore, 
there is no element at the / 2 -level which can be enlarged according to definition 1 
such that the object is contained. 

□ 

Lemma 1 can be exploited to provide boundaries for the relative approximation error of 
objects. As polygons can be arbitrary complex, it is not possible for any approximation 
technique with restricted complexity (such as MBRs or our technique) to provide error 
boundaries. However, we can guarantee a maximum relative error for square objects 
which are not smaller than the basic resolution: 





Figure 4: Enlarged Regions in XZ-Ordering. 
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Lemma 2. Maximum approximation error 

The relative approximation error for square objects is limited. 

Proof (Lemma 2) 

According to lemma 1, the quadrant sequence for a square object of width w has at 
least the length = [logg jCw) J . The width of the corresponding cell is 

Wj = 0.5'' = = 2 -w. 

2 2 

The enlarged element has the area A = 4vvj<16w . 

The maximum approximation error is limited by the following value: 



□ 

Although a relative approximation error of 15 seems rather large, it is an important 
advantage of our technique over the one-value representation that the approximation 
error is limited at all. Our experimental evaluation will show that also the average ap- 
proximation error of our technique is smaller than that of the one-value representation. 

3.2 Numbering Quadrant Sequences 

We are given a quadrant sequence with a length varying from 0 to the maximum length 
g determined by the basic resolution. Our problem is to assign numbers to the sequences 
in an order-preserving way, i.e. the less-equal-order of the numbers must correspond to 
the lexicographical order of the quadrant sequences. Let the length of the quadrant se- 
quence s be /. The following lemma is used to determine the number of cells and ele- 
ments contained in the corresponding region R{s): 

Lemma 3. Number of cells and elements inside of region R{s) 

The number of cells of resolution g contained in a region described by the quadrant 
sequence s with hi = Z is 

^cell(0 = 4"^'. 



The corresponding number of elements (including element R{s) and the cells) is 






4g-^+l_l 



Proof (Lemma 3) 

( 1 ) There is a total of 4^ elements with length I and a total of 4® cells in the data space. 
As both cells and elements of length Z cover the data space in a complete and overlap- 
free way, the area of an element is (l/4)V(l/4)^ = 48- ^ times larger than the 
area of a cell. 
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(2) The number of elements of length i contained in an element of length / corre- 
sponds to 4'“^ . For obtaining the number of all elements, we have to summarize 
over all i ranging from lio g: 



I 

l<i<g 



4§-'+l 



- 1 



3 



□ 



For the definition of our numbering scheme, we have to make sure that between the 
codes of two subsequent strings and S 2 of length I there are enough numbers for all 
strings which are ordered between ij and S 2 - These are exactly the strings having s'j as 
prefix, and their number is thus (4^^^ - l)/3 . We therefore multiply each quadrant 
number in the sequence s = <Qq ... q^ ... qi_\> with (4®“' - l)/3 . 



Definition 2: Sequence Code 

The sequence code C{s) of a quadrant sequence s = <q^ q^ ... qi ... qi_^> corresponds 
to 



C(s) = 



I 




H- 1 



0<i<l 



Lemma 4. Ordering Preservation of the Sequence Code 

The less-equal order of sequence codes corresponds to the lexicographical order of 
the quadrant sequences: 

■*1 ^lex ^2 ^ C’(s'j) < 0 (^ 2 ) 



Proof (Lemma 4) 

‘=>’ : Suppose s'j ^2 with = <qQ...qi_^> and S 2 = <Po--Pm-\>- Then, one of the 
following predicates must be true according to the definition of the lexicographical 
order: 

( 1 ) there exists an i ( 0 < i < min [m,!] - 1 ) such that q^ < and qj = pj for all j < i 

(2) m> I and qj = pj for all j < 1. 

In case (1), we know that the i-th term of the sum of C(ij) is at least by (4^ - 1 )/3 

less than the i-th term in the sum of C(s 2 ). The summands for all j < i are equal. We 
have to show that the difference D of the sums of all remaining terms is smaller than 
Aeiem(^ - i) to guarantee C(^ j) < C(i 2 ) • The difference D is maximal if qj = 3 and 
Pj = 0 for all j > i. In this case, we can determine D as follows: 



D 



I 3 

i<j<g 



AS-J-l 



I 4/-I 

0<j<g-i 



4«-‘-l , . 4«-'-l 

r {g-l+ 1)< 5 



In case (2), C(s' 2 ) > C(s'j), because the first i summands are equal, and C(s' 2 ) has 
additional positive summands which are not available in the sum of C(ij). 

‘<=’: We can rewrite the condition <iex ■* 2 ^ 

■*1 -lex '*2 ^ T’(s' j) > €{$ 2 ) . The proof is then analogue to the ‘=>’-direction. 
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Lemma 5. Minimality of the Sequence Code 

There exists no mapping from the set of quadrant sequences to the set of natural 
numbers which requires a smaller interval of the natural numbers than C{s). 

Proof (Lemma 5) 

p + 

From lemma 3 it follows that there are = (4 - l)/3 different ele- 

ments. C(s) is maximal if s has the form <3®>. In this case, C(s) evaluates to the term 

I = X + I 4']-l 

0<i<g 0<i<g ^0 < i < g 

4^*^ - 1 

The coding of the empty string C(<>) = 0. Therefore, the A^elem(O) different strings 
are mapped exactly to the interval [0, - 1 ]• 

□ 

From lemma 5, it also follows that C is a surjective mapping onto the set {0, ..., 
A^eiem^®) - 1 }. We note without a formal proof that C is also injective. The quadrant 
sequence can be reconstructed from the coding in an efficient way, but this is not neces- 
sary for our purposes. 

4. Query Processing 

4.1 Insert and Delete 

We know from lemma 1 that the quadrant sequence has the length 
Zj = l^logQ j(max{vH/i} )J or Zj -H 1. We can decide this by the following predicate 
which tests whether the object is intersected by one or two grid lines (x/ denotes the 
lower boundary of the object in x direction; the same criterion must be applied for y^): 

— -H 2 ■ 1^<X^ + W 

I \ 

Once the length Z of the quadrant sequence is determined, the corresponding quadrant 
sequence s is determined for the lower left corner of the bounding box of the object, as 
described in section 2.1. This sequence s is clipped to the length Z and coded according 
to definition 2. The obtained value is used as a key for storage and management of the 
object in a relational index. Our actual algorithm performs the operations of recursive 
descent into the quadrants and coding according to definition 2 simultaneously without 
explicitly generating the quadrant sequence. The algorithm runs in 0(1) time. 

4.2 From Window Queries to Interval Sets 

For query processing, we can proceed in a recursive way similar to the algorithm pre- 
sented in section 2.2. We determine which quadrants are intersected by the query. Those 
which are not intersected are ignored. If a quadrant is completely contained in the query 
window, all elements having the corresponding quadrant sequence as prefix are com- 
pletely contained in the query. Therefore, the interval of the corresponding XZ- values is 
generated and marked for a later retrieval from the database. If a quadrant is intersected. 
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FOREACH interval i in list { 

WHILE succ (i). lower - /.upper < maxgap { 
/.upper := succ (/). upper ; 
delete succ (/) ; 



Figure 5: The simple algorithm for gap closing. 

the corresponding XZ-value is determined and marked (it is handled as a one-value 
interval). Then the algorithm is called recursively for all its suh-quadrants. Finally, we 
have marked a set of intervals of XZ- values which are to be retrieved from the database. 
This set is translated into an SQL statement which is transferred to the DBMS. 

Problems arise if the resolution of the basic grid is chosen too fine, because in this 
case, typically many elements are partially intersected. In the average case, the number 
of generated intervals is in the order 0(2®). It is very costly to transfer and compile such 
a complex query. To alleviate this problem, one could apply the simple densification 
algorithm depicted in figure 5 to the set of intervals. As only small gaps between subse- 
quent intervals are closed, it is unlikely that this densification of intervals causes addi- 
tional disk accesses in query processing. The densification after the interval generation 
optimizes query compilation cost and related cost factors, but does not change the gen- 
eral complexity of the interval generation. For this purpose, an algorithm must be de- 
vised which generates the intervals directly in a way that closes the gaps. This algorithm 
is described in the subsequent section. 

4.3 An Efficient Algorithm for the Interval Generation 

This algorithm allows the user to specify the number of intervals which are generat- 
ed. We exploit a general property of XZ-ordering and Z-ordering, which we note here 
without a formal proof: Let us consider interval sets which come up if we restrict our 
interval generation to a certain length 1{Q< I < q) of the corresponding quadrant se- 
quences. If / is increased, it is possible that more intervals are generated. The factor by 
which the number of intervals in the interval set grows when increasing the length I by 
one, is restricted by 4. All intervals of the larger interval set are contained in an interval 
of the smaller set. The additional gaps between the intervals in an interval set generated 
when / is increased by 1, can only be smaller than all gaps which are visible when I is 
decreased by 1. If we would descend the recursion tree of the algorithm in section 2.2 in 
a breadth-first fashion and stop if we have found 4 times as many intervals as demanded 
(4«ijjt), we know that we can density this set to a set of intervals with the largest 
possible gaps between them. 

Instead of the breadth-first traversal, we have implemented the following algorithm: 
In a first phase, depicted in figure 6, the algorithm performs a depth-first traversal for 
determining the number of intervals in each recursion level. The clue is, that the recur- 
sive descent is avoided, if the corresponding level has reached a number of inter- 
vals, because we are only interested in the first level having more than intervals. In 
each level, the algorithm measures the number of transitions in the XZ-order, where the 
extended elements of the space-filling curve cross the query region. As the algorithm is 
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VAR num_ch: ARRAY [0..g] OF INTEGER ; 

PROCEDURE det_num_changes (element, query: REGION; cur_num_ch: INTEGER; 

c_depth: INTEGER; VAR inside: BOOLEAN) { 

IE (NOT intersect (element, query) ) { 

IE (inside) { 

inside := FALSE; INCREMENT (num_ch [c_depth]); 

1 1 

ELSE IF (contains (query, element) ) { 

IF (NOT inside) { 

inside := TRUE; INCREMENT (num_ch [c_depth] ) ; 

1 1 

ELSE { 

IF (NOT inside) { 

inside := TRUE; INCREMENT (num_ch [c_depth] ) ; 

1 

IF (current_depth < g AND c_num_ch + num_ch[c_depth] < «;„() { 

FOREACH subquadrant { 

det_num_changes ( subquadrant, query, c_num_ch + num_ch [c_depth], 
c_depth + I , inside) ; 

1 ) 1 ) 

FUNCTION suitable_level (query:REGION): INTEGER { 

INITIALIZE num_ch := {0, 0, 0} ; 
det_num_changes (dataspace, query, 0, 0, FALSE) ; 
suitable_level := first i where num_ch[(] >= ; 

} 



Figure 6: The Algorithm for Determining the Suitable Recursion Level (“Phase 1”). 
in each level called 4-n-i^i times maximum, the complexity of the algorithm is bounded by 

0(«i„t • g) • 

In the second phase, we generate the corresponding interval set by a depth-first tra- 
versal which is limited to the recursion depth obtained in phase one. Whenever the 
number of intervals becomes greater than the two neighboring intervals with the 
smallest gap between them are merged. In a third phase, the upper bounds of the inter- 
vals are investigated. It is possible that the upper bounds can be slightly improved (i.e. 
decreased) by a deeper descent in the recursion tree. All three phases yield a linear 
complexity in g. 

5 . Experimental Evaluation 

In order to verify our claims that an implementation of spatial index structures does not 
only provide advantages from a software engineering point of view but also in terms of 
performance, we actually implemented the XZ-Ordering technique on top of ORACLE- 
8 and performed a comprehensive experimental evaluation using data from a GIS appli- 
cation. Our database contains 324,000 polygons from a map of the European Union. We 
also generated smaller data sets from the EU map to investigate the behavior with vary- 
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B-f-Tree Inplementation 
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Figure 7: (1.) Comparison between ORACLE and file-based B-l-tree 
(r.) Comparison between various Space-Eilling Curves. 

ing database size. Our application was implemented in Embedded SQL/C (dynamic 
SQL level 3) on HP C-160 workstations. The database server and the client application 
run on separate machines, connected by a 100MBit Fast Ethernet. 

We also had an implementation on top of a file-system based B-i-tree which was used 
for comparison purposes. The reason for this test was to show that our technique is 
implementable on top of a commercial database system and that the additional overhead 
induced by the database system is small. In both cases, we varied the database size from 
1,000 to 324,000 and executed window queries with a selectivity of 1% (with respect to 
the number of polygons in the query result). All experiments presented in this section 
were repeated ten times. The presented results are averages over all trials. The number 
of page accesses (cf. figure 7, left diagram) is in Oracle only by up to 8% higher than in 
the file-based B-n-tree implementation. We used a comparable node capacity in both 
implementations. A second experiment depicted on the right side of figure 7 shows that 
the application of a specific space filling curve has no strong influence on the perfor- 
mance of our technique. Using the same settings, we tested various curves including Z- 
Ordering (Peano), the Hilbert-curve and Gray-Codes. There was no consequent trend 
observable which could suggest the superiority of one of the curves. Therefore, we de- 
cided to perform all subsequent experiments using the Peano-ZMorton-curve, because 
the implementation is facilitated. 

The purpose of the next series of experiments is to show the superiority of our ap- 
proach over competitive techniques such as Orenstein’s Z-Ordering. First, we demon- 
strate in figure 8 that our technique, in contrast to Z-Ordering, is not subject to a perfor- 
mance deterioration when the grid resolution is chosen too high. We constructed several 
indexes with varying resolution parameter g for both techniques, XZ-Ordering and Z- 
Ordering, where we applied the size-bound decomposition strategy resulting in a redun- 
dancy of 4. In this experiment, we stored 81,000 polygons in the database and retrieved 
1% of them in window queries. Z-Ordering has a clear minimum at a resolution of g = 8 
with a satisfying query performance. When the resolution is slightly increased or de- 
creased, the query performance deteriorates. For instance, if g is chosen to 10, the num- 
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Resolution Factor (g) Selectivity [%] 

Z-Ordering Z-Ordering 

XZ-Ordering XZ-Ordering 



Figure 8: (1.) Z-Ordering is sensitive to the resolution 
(r.) Performance comparison with varying selectivity. 

her of page accesses is by 80% higher than at the optimum. At the maximum resolution 
factor g = 13, the number of page accesses was by a factor more than 6.4 higher than the 
optimum. In contrast, our technique shows a different behavior. If the grid resolution 
parameter g is too small, the performance is similar to the performance of Z-ordering. A 
too coarse grid leads obviously to a bad index selectivity, because many objects are 
mapped to the same Z- values or XZ- values, respectively. Both techniques have the same 
point of optimum. Beyond this point, XZ-Ordering yields a constant number of page 
accesses. Therefore, it is possible to avoid the optimization of this parameter which is 
difficult and depends on dynamically changing information such as the fractal dimen- 
sion of the stored objects and their number. This trend is maintained and even intensified 
if the selectivity of the query is increased, as depicted on the right side of figure 8. Here, 
the number of polygons was 324,000 and the resolution was fixed to g = 10. 




Number of Polygons 

One-Value-Approximation 

XZ-Ordering 




Sequential Scan 

XZ-Ordering 



Figure 9: (1.) Comparison with the One- Value-Approximation 
(r.) Comparison with the Sequential Scan. 
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In a further series of experiments, we compared our technique with the One- Value- 
Approximation (cf. section 2.4) and with the sequential scan of the data set. The remain- 
ing test parameters correspond with the previous experiments. Both competitive tech- 
niques are clearly outperformed as depicted in figure 9. The One- Value-Approximation 
yields up to 90% more page accesses. The sequential scan needs up to 326% as much 
processing time as XZ-Ordering. 

In a last experiment, we determined the 
influence of the query processing algo- 
rithm presented in section 4.3. We varied 
again the resolution parameter g from 3 
to 1 1 and generated interval sets of 20 
intervals according to a window query 
with side length 0.03 (the unit square is 
the data space). We measured the CPU 
time which is required for the generation 
of the intervals and of the corresponding 
string for the dynamic SQL statement. 
We compared our algorithm with the 
simple algorithm (cf. section 2.2) ex- 
tended by the gap closing algorithm (cf. 
figure 5). The results are presented in 
figure 10. The simple algorithm has an 
exponential complexity with respect to 
the resolution, whereas the improved algorithm is linear. For the finest resolution 
ig = 11), the improved algorithm is by the factor 286 faster than the simple algorithm. 

6. Conclusion 

In this paper, we have proposed XZ-Ordering, a new technique for a one-dimensional 
embedding of extended spatial objects such as rectangles or polygons. In contrast to 
previous approaches which require the optimization of the resolution parameter in order 
to become efficient, our technique is insensitive against a fine resolution of the underly- 
ing grid. Therefore, the resolution parameter is only restricted by hardware constants 
(the number of bits for an integer value) and can be chosen as fine as possible. This 
robustness is achieved by applying three basic concepts. Object decomposition is avoid- 
ed by the concept of overlapping elements. A sophisticated order-preserving coding into 
the integer domain facilitates the management of the search keys for the DBMS. A new 
query processing algorithm is also insensitive to the grid resolution. 

The superiority of our approach was shown both, theoretically as well as practically. 
We implemented the XZ-Ordering technique on top of the relational database system 
ORACLE-8. Our technique outperformed competitive techniques based on space-filling 
curves as well as standard query processing techniques. 
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Figure 10: Improved interval generation. 
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Abstract. A lot of recent work has studied strategies related to bulk 
loading of large data sets into multidimensional index structnres. In this 
paper, we address the problem of bulk insertions into existing index strnc- 
tures with particular focus on R-trees - which are an important class of 
index structures used widely in commercial database systems. We pro- 
pose a new technique, which as opposed to the current technique of 
inserting data one by one, bulk inserts entire new incoming datasets into 
an active R-tree. This technique, called GBI (for Generalized Bulk In- 
sertion), partitions the new datasets into sets of clusters and outliers, 
constructs an R-tree (small tree) from each cluster, identifies and pre- 
pares suitable locations in the original R-tree (large tree) for insertion, 
and lastly performs the insertions of the small trees and the outliers into 
the large tree in bulk. Our experimental studies demonstrate that GBI 
does especially well (over 200% better than the existing technique) for 
randomly located data as well as for real datasets that contain few natu- 
ral clusters, while also consistently outperforming the alternate technique 
in all other circumstances. 



Index Terms — Bulk-insertion, Bulk-loading, Clustering, R-Tree, Index Struc- 
tures, Query Performance. 

1 Introduction 

1.1 Background and Motivation 

Spatial data can commonly be found in diverse applications including Cartogra- 
phy, Computer-Aided Design, computer vision, robotics and many others. The 
amount of available spatial data is of an ever increasing magnitudes. For exam- 
ple, the amount of data generated by satellites is said to be several gigabytes 
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per minute. Hence, efficient storage and indexing techniques for such data are 
needed. 

Among many index structures proposed for spatial data, R-trees remain a 
popular index structure employed by numerous commercial DBMS systems M 
|Map| . The R-tree structure was initially proposed by Guttman |Gut84| and var- 
ious variations and improvements over the original structure have been suggested 
ever since IRKSSflOl rmTl ITCMI . 

Generally, index structures need to be set up by loading data into them before 
they can be utilized for query processing. Initially, the basic insert operation 
proposed in KIut84l was used to load sets of data into an R-tree, i.e., each 
object was inserted one by one into the index structure. In this paper, we will 
refer to this traditional technique of data sequential loading as the OBO (one- 
by-one) technique. The insertion of data sequentially into the R-tree has been 
found to be inefficient in the case when the entire tree needs to be set up nsnu. 
Thus as an improvement, various bulk loading strategies have been proposed 
in recent years [I.LG971 inm IBWS97I HMj. These bulk loading strategies 
were aimed at creating a complete R-tree from scratch. These techniques thus 
assumed that totally new data was being collected (or existing data files were 
set up to be utilized by some application), and thus an index structure had to 
be constructed from scratch for this new dataset. 

An upsurge of interest in spatial databases is how to efficiently manipulate 
existing massive amounts of spatial data, especially the problem of bulk inser- 
tions of new data assuming an already existing R-tree. The importance of this 
problem for numerous real-world examples is apparent. For example, new data 
obtained by satellites needs to be loaded into existing index structures. Active 
applications using the spatial data should continue functioning while being min- 
imally impacted by the insertion of new data and in addition should be given the 
opportunity to make use of the new data as quickly as possible. The construction 
of a new index structure each time from scratch to contain both the old as well 
as the new data is not likely to scale well with an increasing size of the existing 
original index structure. Instead, new techniques specially tuned to this problem 
at hand are needed. 

1.2 The Proposed Bulk-Insertion Approach 

In our earlier work j( X ;K 98^ . we focussed on the problem of bulk insertion when 
the incoming dataset was skewed to a certain subregion of the original data. By 
skewed, we mean that the dataset to be inserted is localized to some portion 
of the region covered by the R-tree instead of being spread out over the whole 
region. In this paper, we extend the work and now provide a solution that deals 
with the general problem of bulk insertion of datasets of any nature instead 
of just skewed datasets. Both works look into the problem of bulk insertion 
of new datasets into an existing active R-tree. By active, we mean an R-tree 
which already contains (large) datasets, may have been used for some amount 
of time, and which has currently active applications that forbid the possibility 
of scheduling a long down-time for the index structure construction process. 
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However, in contrast to the prior STLT (Small- Tree-Large-Tree) technique 
EMa, we now propose a new technique that is designed to handle bulk- 
insertion cases for different characteristics of the incoming datasets both in 
terms of the size of the new versus existing datasets as well as completely local- 
ized, partially skewed, somewhat clustered, or even completely uniform datasets. 
This technique, called GBI (for Generalized-Bulk-Insertions), partitions the new 
dataset into a set of clusters, constructs R-trees from the clusters (small trees), 
identifies and prepares suitable locations in the original R-tree (large tree) for 
insertions, and lastly performs the insertions of the small trees into the large 
tree. 

Extensive experiments are conducted both with synthesized datasets as well 
as with real world datasets from the Sequoia 2000 storage benchmark to test 
applicability of our new technique. The results for both are comparable, thus 
indicating the appropriateness of the synthesized data for our tests. In our ex- 
periments, we find that GBI does especially well (in some cases even more than 
200% better than the existing technique) for non-skewed large datasets as well as 
for large ratios of large tree to small tree data insertion sizes, while consistently 
outperforming the alternate technique in practically all other circumstances. Our 
experimental results also indicate that the GBI not only reduces the time taken 
for loading the data, but it also provides reasonable query performance on the 
resultant tree. The quality of the resulting tree constructed by GBI in terms of 
query performance is comparable to that created by the traditional tree insertion 
approach. 

In summary, the contributions of this work include: 

1. Design a general solution approach, GBI, to address the problem of bulk 
insertions of spatial data into existing index structures - in contrast to recent 
work on bulk loading data from scratch |id.E97) fLLG97j [BWS97j jEEHSI 
and our previous work that deals with skewed datasets only jGGH.98a,| . 

2. Select and modify McQueen’s k-means clustering algorithm to achieve the 
clustering desired by GBI. 

3. Implement GBI and the conventional insertion technique in an UNIX envi- 
ronment using GH — h in order to establish a uniform testbed for evaluating 
and comparing them. 

4. Gonduct experimental studies both using real-world as well as synthetic 
datasets. These experiments demonstrate that GBI is a winner, both in 
terms of substantial cost savings in insertion times as well as in keeping 
retrieval costs down. 

The paper is organized as follows. Section |2| reviews related work. Section 0 
defines the bulk insertion problem. Section 0 discusses our solution approach. 
Performance results for query tree loading and query-handling are presented in 
Section 0 This is followed by conclusions in Section 0 
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2 Related Work 

The general topic of bulk loading data into an initially empty structure has been 
the focus of much recent work. Two distinct categories of bulk loading algorithms 
have been proposed. 

The first class of algorithms involves the bottom-up construction of the R- 
tree. Kamel and Faloutsos |KF93| use the Hilbert sorting technique to first order 
data and then build the R-tree. Leutenegger et al. |LLE97IJ proposed the STR 
(Sort-Tile-Recursive) technique in which a k-dimensional set of rectangles is 
sorted on an attribute and then divided into slabs. Both techniques are concerned 
with bulk loading and not with bulk insertion. 

The other class of algorithms focuses on a top-down approach to build the R- 
tree. Bercken et al. (BWS97j adapt a strategy using a memory-based temporary 
structure called the buffer-tree. This technique is not likely to be very applicable 
when inserting new data over time - unless the temporary structure was contin- 
uously to be maintained along with the actual R-tree. This would be expensive, 
as buffer pages would be wasted. Arge et al. |AHV V9??) presented a new buffer 
strategy for performing bulk operations on dynamic R-trees. Their method uses 
the buffer tree idea, which takes advantage of the available main memory and 
the page size of the operating system, and their buffering algorithms can be 
combined with any of the existing index implementations. However, their bulk 
insertion strategy is conceptually identical to the repeated insertion algorithm 
while we will present in this paper a conceptually unique bulk insertion strategy 
that can potentially be combined with their buffering algorithms. 

Ilfi.fi.ktl7l discuss bulk incremental updates in the data cube. A portion of 
their work deals with bulk insertions of data which is collected over a period 
of time into R-trees. Their approach uses the sort-merge-pack strategy in which 
the incoming data is first sorted, then merged with the existing data from the 
R-tree and then a new R-tree is built from scratch. The strategy resolves back 
to eventually loading the tree up from scratch, whereas our approach avoids 
this for large existing trees prohibitly expensive step. fMoi9.‘flJ suggests batching 
of data and sorting it prior to insertion. However, sorting phase is expensive 
and requires the data to be collected beforehand, while our algorithm tries to 
avoid the sorting phase. Bulk updates and bulk loading have been studied for 
various structures and in various scenarios |Che97l IT.R,S931 lLN97lir!P^ . These 
techniques are typically specific to the structure in question and are not directly 
applicable to our problem. 

Ciaccia et al. proposed methods of bulk loading the M-tree which is 

possibly the work closest to ours. The proposed bulk loading algorithm performs 
a clustering of data objects into a number of sets, obtains sub-trees from the sets, 
and then performs reorganization to obtain the final M-tree. Our initial STLT 
algorithm is similar to the proposed algorithm in that STLT also constructs trees 
from subsets of data objects. However, in STLT, the choice of an appropriate 
location removes the need for reorganization of the tree in order to re-establish 
the balance of the tree. Again, the problem handled here was of bulk loading 
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data and not of bulk insertion which is the focus of this paper. Heuristics of sizes 
of subtrees to insert were not developed, as done in our work |CCR,98cj . 

Clustering algorithms have been the focus of extensive studies for long. Innu- 
merable clustering algorithms with varying characteristics have been developed, 
such as |D074I |5pa82| , IWarb dL IRom84| . |Spa82| classifies clustering algorithms 
into the two classes of hierarchical and non-hierarchical algorithms. We do not 
attempt to improve any of the clustering algorithms nor suggest a new one. On 
the contrary, we simply select one of the algorithms based on our needs and 
apply it in our GBI solution. 



3 Four Issues of Bulk Insertion 

There are two conflicting goals for our proposed technique of bulk inserting data 
into an R-tree, the first being that the quality of the structure should be as good 
as possible and the second being that the time to load the new data into the R- 
tree should be minimized. This observation raises several issues to be addressed 
by our work: 

— First, what is an effective strategy for inserting sets of data in bigger chunks 
rather than one-by-one so as to minimize down time for applications that 
use the R-tree structure? 

— Second, which characteristics should these sets of to-be-inserted data objects 
(insertion sets) ideally possess so that the bulk-insertion strategy works most 
efficiently (in terms of both insertion time as well as resulting tree quality)? 

— Third, how to group incoming possibly continuous streams of spatial objects 
into insertion sets so that each of them possesses desirable characteristics? 

— Lastly, is it possible to design a framework which allows multiple solutions 
which have different balances between data loading times and the resultant 
tree quality? 

For the first issue, we propose a new bulk-insertion strategy called the ‘Gen- 
eralized Bulk Insertion’ (GBI) algorithm (see Section 0. GBI not only demon- 
strates efficient insertion times, but it also does have minimal impact on existing 
applications (1) by requiring no down time of the existing index by avoiding to 
build a new R-tree from scratch, and (2) by locking as few portions as possible 
of the R-tree for as short as possible a time (often only one single index node) . 

For the second issue, we conduct an extensive study that identifies numerous 
possible characteristics of the to-be-inserted data set in order to determine their 
impact on the insertion time as well as final tree quality (see Section ^ . Ghar- 
acteristics of particular interest are (1) the number of objects in one insertion 
set (size of insertion set versus size of existing tree) and the spatial distribution 
of the new objects (skewness). 

For the third issue, we could simply chop input streams into equally sized sets 
or as we do in our work employ a suitable clustering technique that takes data 
distributions into account (see Section EJ. A new dataset is fed into a clustering 
tool that allows tuning of parameters to control the desired compactness and 
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density of the clusters to be generated. Then clusters as well as outliers are 
ready for the next bulk insert phase, using STLT and OBO, respectively. 

For the fourth issue, we provide a solution framework representing a mech- 
anism for realizing multiple strategies for bulk inserting the new data. These 
parameters allow solutions which range from insertion of all the data in one shot 
to the insertion of data items one at a time. The middle portion of this range 
where data is inserted in multiple sets provides a compromise between loading 
time and tree quality. We provide a set of heuristics for selecting among the 
strategies for insertion KXlRhHd . 

4 GBI: A Generalized Bulk Insertion Strategy 

4.1 The GBI Fhamework 

As depicted in Figure 1, the ‘Input Feed’ module of GBI takes all the data 
to be inserted into the existing tree and gives it to the ‘Input Feed Analyzer’. 
Analysis of the input data is done by the latter module, which passes the result 
of its analysis (number of incoming data, area of coverage, etc.) to the ‘Cluster 
Detector’. The latter identifies clusters of suitable dimensions and accordingly 
separates the data into different clusters and a set of outliers. The generated 
clusters are used to construct a series of separate R-trees (small trees) by the 
‘Small Tree Generator’ module. Note that this step can be done in parallel in 
order to improve performance. The ‘Strategy Selector’ module determines what 
strategy of insertion should be adopted, i.e., whether to gather all the data into 
one large single set and insert it in one shot or to group the data into multiple 
sets. This choice reflects the tradeoff between fast insertion of data and higher 
quality of the resultant tree. The small trees constructed are inserted into the 
existing large tree by the ‘STLT Insertion’ module. The outliers are inserted into 
the existing R-tree using the traditional (OBO) insert function. 

The main steps of our Generalized Bulk Insertion (GBI) framework are the 
following: 

1. Using our proposed heuristics, based on incoming data size and its area of 
coverage, determine values of the clustering modules’ parameters. 

2. Execute the clustering algorithm on the new dataset and obtain a set of 
clusters and outliers. 

3. Build an R-tree (small tree) from one of the generated clusters. 

4. Find a suitable position in the original tree (big tree) for insertion of the 
newly built small tree. 

5. Handle unavailability of entry slot space in large tree using different heuristic 
techniques. 

6. Insert the small tree into the identified location (or created as a result of 
Step 5). 

7. Repeat steps 3 through 6 for each of the generated clusters. 

8. Insert the outliers (also generated from step 2) using a traditional insertion 
algorithm. 

A detailed explanation of the key steps of the GBI framework follows. 
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Fig. 1. GBI Framework 



4.2 The GBI Terminology and Assumptions 



For the following, we assume the terminologies and notations below : 
NumClusters = number of clusters generated by cluster generator 
ST = newly built small tree 

LT = original R-tree before insertion 

ST_Root = root node of the small tree 

LT JSoot = root node of the large tree 

Index Node = an index node of the large tree 

M = maximum number of slot entries in a node 

Diff Height = difference in heights between large and small trees 
An explanation of each of the parameters employed in the clustering algo- 
rithm is as follows : 



k : Initial number of clusters. This may increase or decrease depending on the 
number of clusters actually present and the ordering in which input data 
arrives or is considered. The final number of generated clusters is mainly 
dependent on the input dataset. 
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C : Coarsening value which determines how close one cluster can be to another. 
Any two clusters whose centroids are at a distance less then this value will 
be merged. 

R : Refining value which determines how close a data point should be to a cluster 
to make the cluster a candidate for insertion of the data point. Any data to 
be inserted that is at distances greater than R from all the clusters will be 
inserted in its own separate cluster. If there are more than one clusters at 
distance less than R, then the data will be inserted in the closest cluster. 
The cluster where the data is inserted has its centroid recomputed and the 
possibility of cluster merges is examined, 
fmin : Minimum flush value which determines how large a cluster should be 
(in terms of the number of data elements present) in order to consider it a 
candidate for a small tree. If the number of elements in a cluster is less than 
fmin, then the data elements are inserted into the outlier list, 
fmax : Maximum flush value which determines the maximum size allowed for 
a cluster. 



ALGORITHM GBI(LT_Root) 

{ 

SetClusteringParameters (f , C, R, K) ; // Initial tuning values 
while a pause ( > setted time period ) of inputing data { 

NumClusters = InvokeClusteringAlgo(DatasetFromlnputFeed) ; 
repeat (for NumCluster clusters) { 

ST_Root = BuildSmallTree (next ClusterFile) ; 

Diff Height = LTHeight - STHeight; // Insertion level 
if (Diff Height < 1) // Large tree <= small tree 

adopt OBO insertion; // not suitable for GBI bulk insertion 
else // Insert small tree 

InsertSTintoLT(LT_Root, ST_Root, DiffHeight) ; 

} 

InsertUsingOBO(OutliersSet) ; 

} 

} 



Fig. 2. The GBI Framework: Bulk Insertion of New Data into Large Tree 



4.3 GBI Framework Description 

First, the entire new dataset is fed to the clustering tool that then breaks up the 
dataset into appropriate clusters and a set of outliers. The generated clusters are 
controlled by a few parameters which determine the number of data items in a 
cluster and also the compactness or density of the generated clusters. The GBI 
algorithm given in Figure El and as visually depicted in Figure 3 first identifies 
clusters and outliers from the given input dataset and then builds a small tree 
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(denoted by ST) from each of the clusters, as indicated by the function BuildS- 
mallTree() invoked in the algorithm. Next, considering one small tree at a time, 
we compute the difference in heights of the large tree and small tree as this 
would determine how many levels we need to go down in the large tree in order 
to locate the appropriate place for insertion of the small tree. If the new data 
is larger than the existing data, then the proposed technique is not meaningful. 
Then we instead simply apply one of the bulk-loading strategies to build a new 
tree containing both old and new data from scratch. Otherwise, we invoke the 
InsertSTintoLT() function to insert ST into an appropriate IndexNode in the 
big tree. The previous step is repeated for each of the small trees. Finally, once 
all clusters are exhausted, the outliers are inserted into the large tree one-by-one. 



GENERATED SMALL TREES 




4.4 GBI Clustering Module 

For the clustering, we use a variant of the MACQueen’s k-Means Method 
with a suitable extension. This algorithm is chosen because it can be easily 
modified to allow for clusters of fixed maximum or minimum sizes. It can also 
be made to adapt for continuous incoming data. Figure 3 shows how clusters 
and outliers are formed from a given set of input data elements and how some 
of the formed clusters can be potentially merged (based on the values of the 
parameters of the clustering algorithm) . 

1 . Select the proper values of the parameters : number of clusters k, coarsening 
value C, refining value R, and flush values frain and fmax. 

2. For the input, let each of the first k data unit be a cluster of size one. 
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3. Determine the minimal distance C between clusters. Merge the clusters with 
the distance of their centroids less than C until all the clusters are at a 
distance greater than C from one another. 

4. Take the next data element and determine the cluster closest to it. 

5. Decide to insert the data into its closest cluster if it is at a distance of less 
than R to that cluster, then recompute its centroid and merge clusters that 
their distance become less than C. If the closest cluster is at a distance 
greater than R, take the data as a new cluster of one member. 

6. Repeat Step 4 and 5 for the remaining data given by the Input Feed module. 

7. Take each of the cluster centroids as fixed seed points and reallocate each 
data to its nearest seed point. 

8. For each cluster, if the number of data items it contains is greater than f, 
then consider it as a cluster, otherwise insert it in the outlier list. 

4.5 GBI - STLT Module 

A detailed description of the STLT (small-tree-large-tree) strategy, including 
precise algorithms for the module to insert one small tree into an existing large 
tree, is given in [kj(jto8aj , while below we give a brief summary of its basic ideas 
below. Let the height of the newly built small R-tree be h^- and the height of 
the original R-tree be Hu. We consider the root rectangle of the small R-tree 
(enclosing rectangle of all new data rectangles) as a data rectangle. In other 
words, we use the standard insert operation to find a suitable place to insert the 
newly built R-tree into the existing R-tree (referred to as the large tree) as if it 
were an individual data item. We try to insert it into the level I = Hfj — hr of 
the original R-tree. Our goal here is to assure that the bottom level of the small 
R-tree is on the same level as that of the original R-tree as seen in Figure 01 
This is in order to ensure that the resultant tree remains balanced which is a 
fundamental requirement for the structure to be an R-tree. 




Fig. 4. Insertion of One Small Tree into the Large Tree 
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5 Experimental Evaluation of the GBI Framework 

5.1 Experimental Setup 

Testbed Environment: Our performance studies are conducted on a testbed 
on a SUN Sparc-20 workstation running the UNIX operating system. The testbed 
includes the original R-tree with Quadratic splitting [™4| , an I/O buffer man- 
ager, modified MACQueen’s k-means clustering module, input feed analyzer, 
strategy selector module and other supporting data structures and procedures. 
Test Data: The test data comprised both real datasets and synthetic datasets. 
Most of our tests are based on synthetically generated testing datasets to be able 
to verify the usefulness of GBI under different extreme settings. The synthetic 
data was generated with varying parameters to control the distribution and the 
nature of the objects. The real data from the TIGER/Line files distributed by 
the US Gensus Bureau consists of a dataset of streets (131,461 objects) and a 
dataset of rivers and railway tracts (128,971 objects) from an area in Galifornia. 
Test Types: We carried out two major classes of tests. The first type of exper- 
iments were conducted to compare the I/O insertion costs of GBI with OBO 
for different parameters. The tests were designed to evaluate the performance of 
GBI and compare that of OBO in terms of I/O costs for different parameters 
and determine the usefulness and limitations of GBI. The second type of exper- 
iments focussed on evaluating the quality of the resultant trees formed by GBI 
or OBO style of insertions. The tests comprised asking queries on the resultant 
trees to measure the number of nodes visited to answer the queries and the I/O 
cost incurred in answering the queries. 

5.2 Experiments Measuring Insertion Cost and Query Cost for 
Different Data Area Ratios 

This experiment is used to evaluate the performance of GBI as compared to 
OBO, when the area of the new dataset is varied from a small percentage of 
the large tree until it equals the area covered by the original tree. A set of 5000 
data elements is inserted and the insertion times are measured. A set of 50000 
random queries is asked on each of the trees generated after OBO insertion and 
after GBI insertion to evaluate the query performance. The number of elements 
in the large tree is 100000. 

As shown in Figure |5l GBI wins over OBO for most ‘G’ and ‘R’ values. For 
larger ‘G’ and ‘R’ values, the improvement in insertion times is maximal. This is 
because, for larger values, many clusters are formed and there are few outliers. 
In such cases, the insertion cost is the sum of building the small trees and then 
inserting them one by one, which is less than inserting data elements one by one. 

An important result is that as the new data becomes less and less skewed, 
the savings in terms of insertion times becomes more significant. This shows that 
the GBI algorithm proves to be useful for non-skewed random data. The reason 
for this improvement is to some degree the fact that for less skewed data, OBO 
is not able to exploit the locality of pages in the buffer as most elements belong 
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Fig. 5. I/O Insertion Cost for Different Area Ratio of Original Tree Dataset to 
New Dataset. 



to different pages. Hence, the I/O cost of insertion for OBO increases as the new 
dataset becomes less localized. 
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Fig. 6. Query Cost for Different Area Ratio of Original Tree Dataset to New 
Dataset. 



On the other hand, Figure El shows that the query performance of the GBI 
algorithm is comparable to the OBO query performance for smaller ‘C’ and ‘R’ 
values. This is because the clusters formed are much tighter if the values of 
‘C’ and ‘R’ are lower. The figure also shows that GBI improves on the STLT 
query performance while yielding significant savings in the insertion 
time. This is done by ensuring that the data to be inserted into the original tree 
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is clustered and thus each small tree does not cover too large an area of the large 
tree region. 

5.3 Experiments Measuring Insertion Cost for Different Data Sizes 

In this experiment, we keep the ratio of the large tree data size to the new data 
size fixed (at a value of 50) and instead vary absolute data sizes by increasing 
both small and large tree sizes at the same percentage. The area covered by both 
the original tree and the new dataset is the same. 




Fig. 7. I/O Insertion Cost for Different Ratios of Original Tree Size to the New 
Dataset Size 



As seen from Figure Q as the relative sizes of the original tree and the new 
dataset increase, the insertion cost for OBO increases fairly rapidly whereas 
the costs for GBI increase less rapidly. Thus, GBI yields more improvement for 
relatively larger sizes of original tree and new dataset. This experiment shows 
that GBI is scalable and is not constrained to small sized datasets. 

5.4 Experiments on Determining Effect of ‘C’ and ‘R’ Values 

In this experiment, we analyze the effect of varying ‘C’ and ‘R’ values on the 
insertion times and the resulting tree quality in terms of query performance. We 
insert a set of 5000 data elements randomly located over the region of the large 
tree and measure the insertion costs and query costs for 50000 random queries. 
By randomly located, we mean that the data elements are randomly spread out 
all over the area enclosed by the original tree elements which is 30000 by 30000. 

As can be seen from FigureEl as the ‘C’ and ‘R’ values increase, the insertion 
time decreases. This is because for larger ‘C’ and ‘R’ values, larger size clusters 
are formed and there are fewer data elements which are inserted individually 
thus improving on the insertion times tremendously (100-fold improvement for 
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Fig. 8. I/O Insertion Cost for Different ‘C’ and ‘R’ Values 



‘C’ and ‘R’ values greater than 3000). For low ‘C’ and ‘R’ values, there are fewer 
clusters which are used to construct small trees and most of the data not being 
part of a cluster but being an outlier is inserted using the OBO technique. For 
very small values of ‘C’ and ‘R’, the technique yields no small trees and hence 
there is no difference in OBO and the GBI insertion costs. 
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Fig. 9. I/O Query Cost for Different ‘C’ and ‘R’ Values 



The query performance corresponding to the resultant trees from the above 
experiment is shown in Figure El This shows clearly that GBI yields trees of good 
quality when the ‘C’ and ‘R’ values are kept low. This is because low ‘C’ and ‘R’ 
values yield more dense clusters which keep the query costs low. For higher ‘C’ 
and ‘R’ values, the clusters formed are less dense and thus the trees generated 
using them are of slightly lower quality (in terms of intersecting MBRs) and 
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thus the query performance of GBI becomes worse. This experiment yields a 
maximum of one extra I/O on the average for the tree constructed using GBI 
as compared to the one constructed using OBO. Glearly, settings for ’G and ’R’ 
values must be empirically selected to find a trade-off between maximizing tree 
retrieval quality while minimizing tree insertion times. 



5.5 Experiments on Real Datasets Using Tuned GBI 

Based on the experiments listed thus far and additional ones that can be found 
in our technical report PEEnHc], we have empirically determined values for 
initializing the key parameters of the GBI framework, such as values for C, R, 
flushmin, flushmax, etc. (see Section l4.1 1 for explanation of these parameters). 
A justification of this tuning of GBI, then called the GBI* heuristic strategy, 
can be found in |M9| . while below we give some insight into the performance 
achieved by this tuned GBI*. The experiments are done with real data extracted 
from the Sequioa 2000 storage benchmark. The experiment uses different sizes 
of the large tree and new dataset but the ratio is kept fixed at 50. 




Fig. 10. Gomparison of Insertion Gosts of GBI* and OBO for Real Datasets 



Figure ng clearly shows that GBI* wins out over OBO in terms of insertion 
costs. As the sizes of the datasets increase, the savings in insertion cost increase 
as well. Figure mi displays the query costs for GBI* and OBO which are very 
similar thus indicating that the GBI* generated tree is of acceptable quality. 

6 Conclusions 

In contrast to previous work on bulk loading data which primarily focussed on 
building index structures from scratch, in this paper we tackle the problem of 
bulk insertions into existing active index structures. We extend our earlier work 
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Fig. 11. Comparison of Query Costs of GBI* and OBO for Real Datasets 



fCCR,981^ where we addressed the problem for skewed datasets only. While our 
current work focuses on R-tree, we contend that the overall strategy of group-first 
and then insert-as-bulk is a general approach and thus could also be explored 
for other multi-dimensional index structures. 

The proposed framework, called GBI (for Generalized Bulk Insertion), con- 
siders the new dataset as a set of clusters and outliers, constructs from the 
clusters a set of R-trees (small trees), identifies and prepares suitable locations 
in the original R-tree (large tree) for insertion of each of the small trees, and 
lastly bulk-inserts each small tree into the large tree. 

We have reported experimental studies designed to not only compare GBI 
against the conventional technique, but also to evaluate the suitability and lim- 
itations of the GBI framework under different conditions. We have found that 
GBI does especially well (over 200% better than the existing technique) for non- 
skewed datasets as well for large ratios of large tree sizes as compared to sizes 
of the new data insertions (see Figure 0, while consistently outperforming the 
alternate technique in all other circumstances. Our experimental results also in- 
dicate that the quality of the resulting tree constructed by GBI in terms of query 
performance is acceptable when compared to the resulting tree created by the 
traditional tree insertion approach. All in all, the GBI bulk insertion strategy has 
thus been found to be a viable and significant optimization over the conventional 
one-by-one insertion approach, gaining both in terms of update costs as well as 
preserving the resulting index access quality. 

Possible future tasks include application of the general approach of GBI to 
other multi-dimensional index structures as well as experimental evaluation of 
alternate approaches against GBI such as from-scratch bulk-loading techniques. 
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Abstract. Spatiotemporal applications, such as fleet management and 
air traffic control, involving continuously moving objects are increasingly 
at the focus of research efforts. The representation of the continuously 
changing positions of the objects is fundamentally important in these ap- 
plications. This paper reports on on-going research in the representation 
of the positions of moving-point objects. More specifically, object posi- 
tions are sampled using the Global Positioning System, and interpolation 
is applied to determine positions in-between the samples. Special atten- 
tion is given in the representation to the quantification of the position 
uncertainty introduced by the sampling technique and the interpolation. 
In addition, the paper considers the use for query processing of the pro- 
posed representation in conjunction with indexing. It is demonstrated 
how queries involving uncertainty may be answered using the standard 
filter-and-refine approach known from spatial query processing. 



1 Introduction 

A relatively new research area, spatiotemporal databases concerns the man- 
agement of objects with spatiotemporal extents, and real-world objects with 
continuously changing spatial extents are attracting substantial attention. The 
variety of applications suggests that there is not just one prototypical type of 
spatiotemporal application. 

Spatiotemporal applications may be distinguished based on the data they 
manage, which may pertain to the past, the present, and the future, or a com- 
bination of these. For example, applications managing past data often conduct 
analyses of movements over time, answering queries such as, “What were the 
movements of the Vikings in the North Sea between year 1000 and year 1200?” 
Applications dealing with present and future data capture the current spatial ex- 
tents of objects in the database and typically make predictions about the future 
extents of the objects. Sample queries include, “What is the position of flight 
SAS 286?” and “Where will flight SAS 286 be in 20 minutes?” Next, a specific 
type of application concerns real-world objects that move continuously and dis- 
regards the spatial extents of the objects, representing instead their positions 
as points. Candidate applications include fleet management, air traffic control, 
military command-and-control systems, and people tracking. This paper focuses 
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on the representation of the past and present positions of such moving-point 
objects. 

Fundamental issues in these applications include the acquisition and repre- 
sentation of the movements of objects, including the inherent imprecision in the 
representation. For example, when representing the positions of vehicles based on 
sampling, the sampled positions are inherently imprecise, as are the interpolated 
positions in-between the samples. As a result, the record of the movements of 
objects as stored in the database differs from the actual movement. The impreci- 
sions due to the measurements and caused by the use of sampling are inherently 
quite different. It is highly relevant to understand the nature of these impreci- 
sions because this makes it possible to decide on their relative importance. 

This paper’s contributions are three-fold. First, it offers a proposal for repre- 
senting the positions of moving-point objects in databases. Second, it quantifies 
the imprecisions in the proposed representation. The representation is modular, 
allowing the imprecision to be captured or not, depending on the application 
requirements. Third, the paper illustrates how the representation may be used 
in conjunction with indices to answer queries involving uncertainty. The two- 
step filter-and-refinement process known from spatial query processing is used 
together with error information. 

Past database research has focussed on spatiotemporal applications where 
only the present and future positions of moving-point objects are relevant. In 
the context of applications that predict the movements of objects based on their 
current positions, speeds, and directions, Wolfson et al. (Ilbll address position 
update policies and the imprecision involved in the database-representation of 
the positions. Next, Moreira et al. (0) present a data model for moving-point 
objects that is based on the decomposition of the trajectories of the objects 
into sections. In addition, so-called superset and subset semantics are proposed 
that aim to address uncertainty issues. A maximum error occurs when linearly 
approximating the movement of an object in-between samples, and this error 
is used in the process of query processing. However, this work is not connected 
to any specific application or technological context and thus does not cover the 
ranges of errors and the relationships between different error measures. The 
query processing aspects also do not consider the availability of indices. Giiting 
et al. © present a comprehensive framework of abstract data types for moving 
objects. This work, however, does not address representation issues, nor does it 
accommodate uncertainty. 

The outline of the paper is as follows. Section 0 presents an application sce- 
nario and describes a particular technological context for the application, the 
Global Positioning System. Section 0 proceeds to describe, quantify, and relate 
the measurement and the sampling errors in the context of the application sce- 
nario and accommodates also error information in the representation. This sets 
the stage for a proposal for a database representation for moving-point objects, 
presented in Section^ Section 0 considers the utilization of this representation in 
query processing using indices. Finally, Section Elconcludes and offers directions 
for future research. 
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2 An Application Scenario — GPS-Based Fleet 
Management 

This section presents a sample spatiotemporal application scenario, fleet man- 
agement, and briefly introduces the Global Positioning System (GPS), the tech- 
nology that is assumed used for sampling the positions of moving objects. 



2.1 Fleet Management 

The optimization of transportation, especially in highly populated areas, is a very 
challenging task that may be supported by an information system. An example 
fleet management project, conducted by the Department of Transportation of 
the State of Galifornia, Galtrans (PD, aims to design what is termed the Advanced 
Transportation System. In this application, vehicles equipped with GPS devices 
transmit their positions to a central computer using either radio communication 
links or cellular phones. At the central site, the data is processed and utilized. 
Example queries occurring in such an application are as follows. 

— Which taxi is closest to customer A? 

— What is optimal taxi distribution over the area (somewhat related to pickups 
per area)? 

— Gompute the optimal route for a ride, considering road characteristics such 
as the actual and theoretical speed limits, congestions, accidents, etc. 

Taking uncertainty into account, more sophisticated queries may be formulated. 

— Which taxis were, with a 50% probability, within 100 meters of the Ritz 
hotel at 14.20 on April 22, 1999? 

— How likely is it that taxi 1234 had visual contact with (was within 100 meters 
of) taxi 4321 between 9.00 and 13.00 on April 22, 1999? 

— Which taxis were with 50% probability in Gentral Park at 10.00 on April 
22, 1999. 



2.2 Global Positioning System 

The Global Positioning System is able to determine exact positions on Earth 
anytime, in any weather, and anywhere. The system consists of 24 satellites that 
orbit Earth at 20000 km. The satellites transmit signals that can be detected 
by GPS receivers, which then are able to determine their locations with great 
precision. 

The principle behind the GPS is the measurement of the distances between 
a receiver and several satellites. A total of four distances, and thus signals from 
four satellites, are needed to solve a set of four equations that expresses the 
latitude, longitude, height, and time (Magellan Gorporation |B|) . The distance 
from the satellite to the receiver can be calculated by multiplying the time it 
takes for the signal to arrive by the speed at which it travels-the speed of light. 
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Although four visible satellites are enough to compute a position, the more 
satellites that are visible, the more precise the computed position becomes. 

More information about the GPS can be found in, e.g., Magellan Corporation 
(0 and Leick ©• 

3 Sampling and Uncertainty 

This section covers how to acquire and represent the movement of point objects. 
We first give the technical means of how to determine the time- varying positions 
of moving point objects, and subsequently give a suitable way to represent the 
entire movement. An important part of the representation is the uncertainty 
caused by the acquisition process. The section describes the uncertainty caused 
by the measurement error and the sampling error, and it concludes with a dis- 
cussion of the relative importance of these errors. 



3.1 Acquiring Movement — Measnring Position in Time 

In order to record the movement of an object, we would have to know the position 
at all times, i.e., on a continuous basis. However GPS and telecommunications 
technologies only allows us to sample an object’s position, i.e., to obtain the 
position at discrete instances of time such as every few seconds. 

The solid line in Fig.Q](a) represents the movement of a point object. Space 
(x and y axes) and time (t axis) are combined to form one coordinate system. 
The dashed line shows the projection of the movement in two-dimensional space 
(x and y coordinates). 

A first approach to represent the movements of objects would be to store the 
position samples. For our database, this would mean we could not answer queries 
about the objects’ movements at times in-between sampled positions. Rather, 
to obtain the entire movement we have to interpolate. The simplest approach 
is to use linear interpolation, as opposed to other methods such as polynomial 
splines (Bartels et al. P). The sampled positions then become the end points of 
line segments of polylines, and the movement of an object is represented by an 
entire polyline in three-dimensional space. In geometrical terms, the movement 
of an object is termed a trajectory (we will use “movement” and “trajectory” 
interchangeably) . 

Fig.nKb) shows the spatiotemporal space (the cube in solid lines) and several 
trajectories (the solid lines). Time moves in the upward direction, and the top of 
the cube is the time of the most recent position sample. The wavy-dotted lines 
at the top symbolize the growth of the cube with time. 

3.2 Quantifying Uncertainty 

The research on uncertainty in geospatial information is concerned with all 
sources of incorrectness and incompleteness in the measurement, analysis, and 
interpretation of digitally-represented. Earth-referenced phenomena (Unwin EJ- 
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(a) Trajectory of a moving point 




(b) A spatiotemporal space 



Fig. 1. Movements and spaces 



A representation of moving-point trajectories is inherently imprecise: impre- 
cision is introduced by the measurement process used in the sampling of positions 
and by the sampling approach itself. A useful representation of moving points 
must take these uncertainties into account. 

In this paper, we make the following assumptions. 

— We will not consider any error connected to the times of measurements. We 
assume that we know precisely the time a position sample was taken. This 
assumption seems to be justified when using the GPS and its precise clocks 
as a measuring device. 

— Within one application, we will only consider objects with similar movement 
characteristics, such as speed and range. Typical examples of objects with 
different characteristics include people, cars, and planes. 

A first step in incorporating uncertainty into a representation of trajectories is to 
quantify it. We thus proceed to describe the errors introduced by the trajectory 
acquisition process. 

3.3 Measurement Error 

Generally, an error can be introduced by inaccurate measurements (Leick [3). 
The accuracy and thus the quality of the measurement depends largely on the 
technique used. This paper assumes that the GPS is used for the sampling of 
positions. 

Two assumptions are generally made when talking about the accuracy of the 
GPS. First, the error distribution, i.e., the error in each of the three dimensions 
and the error in time, is assumed to be Gaussian. Second, we can assume that 
the horizontal error distribution, i.e., the distribution in the x-y plane, is circular 
(van Diggelen II 4jl . 

The error in a positional GPS measurement can be described by the prob- 
ability function in Equation (C^. The probability function is composed of two 
normal distributions in the two respective spatial dimensions. The mean of the 
distribution is the origin of the coordinate system. Fig. El visualizes the error 
distribution. In addition to the mean, the standard deviation, cr, is a character- 
istic parameter of a normal distribution . Within the range of ±(t of the mean. 
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in a bivariate normal distribution (2-dimensional), 39.35% of the probability is 
concentrated. 



'Pi{x,y) 



1 

2na'^ 



( 1 ) 




Fig. 2. Positional error in the GPS 



Example 1. A typical GPS module used in vehicle navigation systems is the 
GrossGheck AMPS Gellular from Trimble Navigation Ltd. This GPS/cellular 
phone system has an error of 2m (equal to 1 cr) (Trimble Navigation Ltd. IT^ . 
This measure refers to the standard deviation of a bivariate normal distribution 
centered at the receiver’s true antenna position. 

3.4 Uncertainty in Sampling 

We capture the movement of an object by sampling its position using a GPS 
receiver at regular time intervals. This introduces uncertainty about the position 
of the object in-between the measurements. In this section, we give a model for 
the uncertainty introduced by the sampling, based on the sampling rate and the 
maximum speed of the object. 

Sampling Error The uncertainty of the representation of an object’s movement 
is affected by the frequency with which position samples are taken, the sampling 
rate. This, in turn, may be set by considering the speed of the object and the 
desired maximum distance between consecutive samples. Let us consider the 
running example, in which we want to record the movements of taxis. 

Example 2. As a requirement to the application, the distance between two con- 
secutive samples should be maximally 10 meters. If the maximum speed of a taxi 
is 150km/h, this means that we would need to sample the position at least 4.2 
times per second. If a taxi moves slower than its maximum speed, the distance 
between samples is less than 10 meters. 

We proceed to consider how the position samples resemble the true movement of 
the object. Gonsider the three trajectories shown in Fig.O Each is possible given 
the three measured positions Pi through P3. However, by just “looking” at the 
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Fig. 3. Possible trajectories of a moving object 



three positions, one would assume that the straight line best resembles the actual 
trajectory of the object. Since we did not measure the positions in-between two 
consecutive position samples, the best we can do is to limit the possibilities of 
where the moving object could have been. We have to constrain the trajectory of 
the object by what we know about the object’s actual movement. Considering the 
trajectory in a time interval [ti , t2] , delimited by consecutive samples, we know 
two positions. Pi and P2, as well as the object’s maximum speed, Vm', see Fig .0 
If the object moves at maximum speed Vm from Pi and its trajectory is a straight 
line, its position at time tx will be on a circle of radius ri = Vm(ti + tx) around 
Pi (the smaller dotted circle in Fig.EJ. Thus, the points on the circle represent 
the furthest away from Pi the object can gotten at time tx - If the object’s speed 
is lower than Vm, or its trajectory is not a straight line, the object’s position at 
time tx will be somewhere within the area bounded by the circle of radius ri. 
Next, we know that the object will be at position P2 at time ^2- Thus, applying 



h) 



yi’ h) 



Fig. 4. Uncertainty between samples 



the same assumptions again, the object’s position at time tx is on the circle with 
radius T2 = Vm{t2 — tx) around P2- If the object moves slower or its trajectory 
is not a straight line, it is somewhere within the area bounded by this circle. 

The above constraints on the position of the object mean that the object 
can be anywhere in the intersection of the two circular areas at time tx- This 
intersection is shown by the shaded area in Fig. ^ and we use the term lens 



118 



Dieter Pfoser and Christian S. Jensen 



for this intersection. Since we do not have any further information, we assume a 
uniform distribution for the position within the lens. 

Thus, the sampling error at time tx for a particular position can be described 
by the probability function shown in where ri and T2 are the two radii 
described above, s is the distance between the measured positions P\ and P2, 
and A denotes the area of the intersection of the two circles. 



'P2{x,y) 



^ for + y'^ < A {x — s)^ + 2/^ < rl 
0 otherwise 



( 2 ) 



Substituting Vm{ti +tx) and Vm{t2 — tx) for the radii r\ and T2, respectively, the 
probability function shown in Equation (jSJ results. Its parameters are described 
in Table [D 

( ^ for + 2/2 < {vm{ti + tx))‘^A 
V2{x,y)=< {x - s)'^ + y"^ < {Vm{t2 - tx))"^ (3) 

[ 0 otherwise 

For a visualization of a sampling error, refer to Fig. ISfa), in which the two 
horizontal axes depict x and y coordinates, and the vertical axis the positional 
probability. 



Table 1. Parameters of the probability function, V2, describing the sampling 
error 



Vm maximum speed of the moving object 
tx time for which the error distribution is computed 
ti time of the first measured position 
t2 time of the second measured position 

s distance between the two positions, i.e., the length of the line segment 
A lens area, i.e., the area of the intersection of the two circles 




(a) Normal-case sampling error (b) Worst-case sampling error 

Fig. 5. Probability functions for sampling errors 



Sampling Error Across Time So far, we have quantified the sampling error 
for the position at a single point in time. To determine the error across time, as a 
first step, we compute the lens for various tx S [ti; t2] as shown in Figs.|S(a)-(c). 
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The circle around the first point, P\, measured at time t\, is initially a point 
and grows as time advances, and the circle around the second point, P2, shrinks 
with the advancement of time and eventually becomes a point. In the first situa- 
tion in Fig. Eta), the circle around P2 contains the one around P\ , meaning that 
the constraint on how far away the object can be from P\ at is more restrictive 
than the constraint on how close it has to be to P2- The area of intersection is 
the total circle or radius ri. In the second situation, Fig. Elb), the two circles 
start intersecting, and in Fig. ETc) they show a clear intersection. 

We observe that the intersection points of the two circles over time, i.e., for 
the cases the circles do actually intersect, lie on an error ellipse with positions Pi 
and P2 as its foci (cf. Fig. (Tj). The length of the semi-major axis is 2 a = ri + r2- 
This is not surprising if we consider the definition of an ellipse. An ellipse is 






Fig. 6. Evolving sampling error 




2a 



Fig. 7. Error ellipse 



a curve consisting of all points in the plane whose sum of distances, ri and T2, 
from two fixed points. Pi and P2 (the foci) separated by a distance of 2 c, is a 
given constant, 2 o. The measure 2 c can be interpreted as the observed distance 
between Pi and P2, whereas 2 a is the maximum distance the object can travel. 
The “thickness” of the ellipse, 2 b, is determined by the equation = a^ — c^. 
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This means that the smaller the difference between the observed distance, 2c, 
and the maximum distance, 2a, the “thinner” the ellipse. In the extreme case, 
the ellipse degrades to a line segment. In the worst case, where the object does 
not move between consecutive position samples, the ellipse becomes a circle. 



Sampling Rate Having derived the general principle behind the sampling error, 
we give an example of how an increased sampling rate affects the error size. To 
illustrate the underlying principle, we use the error ellipse given in Fig. 0 as a 
measure for the size of the sampling error per line segment. 

Example 3. In Fig. 0 we show the actual trajectory of a moving object as a 
bold line. As a first step, we sample the movement of the object at position Pi 
and P 2 - The time in-between the samples is 10 seconds. The shortest distance 
from Pi to P 2 is 300 meters. Thus, to the best of our knowledge the object 
travels at a speed v of 30m/s. If we further know the maximum speed of the 
object to be A2m! s, we can draw an error ellipse around the line approximating 
the movement. The error ellipse has an eccentricity 2c = 300m, a major axis 
2a = Vmax ■ = 42m js x 10s = 420m, and a minor axis 2b = (2a)^ — (2c)^ = 

-\/420^ — 300^ = 294m. This rather large error ellipse means that the position of 
the object in-between samples is quite uncertain. Quadrupling the sampling rate, 
i.e., sampling the position every 2.5 seconds, leads to an error ellipse that has 
an eccentricity 2c = 80m, a major axis 2a = Vmax ■ = 42m/s x 2.5s = 105m, 

and a minor axis 2b = yj (2o)^ — (2c)^ = ^105^ — 80^ = 68m. 




Fig. 8. Varying sampling rate 



If we increase the sampling rate, the sample positions better approximate the 
movement, and the error introduced by sampling is decrease. 

Maximum Speed Challenged An underlying assumptions so far has been that the 
maximum speed of a moving object is fixed at Vmax- However, the more we know 
about the object in question, the further we can narrow down Vmax and thus 
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reduce the uncertainty. For example, if we know that a taxi can reach 200km/h, 
but regulations of the company set 120km/h as the upper limit, we may decide to 
assume that Vmax is 150km/h. Further examples of such additional information 
are local speed limits and road conditions; thus the maximum speed can vary 
depending on the area the taxi is in. Traffic volumes, which are time dependent, 
may also be taken into account. Further, there might be individual speed limits 
for drivers and cars. Generally, the more information we have about an object, 
the better we can adjust the sampling rate, reduce the error, and, consequently, 
minimize the uncertainty attached to its polyline trajectory. 

Worst- Case Sampling Error Previously we identified the size and extent 
of the sampling error for a particular line segment and time. However, for use 
in Section 0 we also need an error measure in the situation where an object 
does not move between consecutive samples. In this case, the sampling error is 
determined by a circle of radius, r, equal to half the sampling interval multiplied 
by the maximum speed. 

Example 4- Consider again the taxi from Example 01 whose position is sampled 
every 2.5 seconds. If the taxi is stopped, the eccentricity is 2c = 0 (the foci 
coincide) and the error ellipse degrades to a circle. The major axis, 2a = Vmax ■ 
At = 42to/s • 2.5s = 105m, is equal to the minor axis. The radius of the circle 
then is 52.5 meters. 

If we have no further information about the position of the object, all positions 
within the circle have the same possibility, yielding a circular uniform, worst- 
case error distribution, for which the probability function is given below, where 
r is the radius. 



For a visualization of the worst-case sampling error, refer to Fig. Elb), in which 
the two horizontal axes depict x and y coordinates and the vertical axis the 
positional probability. 

3.5 Comparison of Error Sources 

With current GPS technology, a moving object’s position can be determined 
instantaneously with an accuracy of 2m (cf. Example , and this error will 
be reduced further with the advancement of GPS technology. How frequently 
position samples are taken depends on the particular application. In fleet man- 
agement, determining the position every 2.5 seconds leads to a worst-case error 
of roughly 50m. This is the radius of a circular distribution assuming that the 
maximum speed of the objects is 150/cm/ft,, cf. Example E|). In practice the sam- 
pling rates will be much lower, thus allowing for worst-case errors of 200m or 
more. 

It follows that the measurement error is small compared to the sampling error 
in fleet management. Therefore, we will consider only the uncertainty that stems 
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from the sampling, and disregard the uncertainty caused by the measurement 
technique, in the remainder of the paper. 

4 A Representation for Moving Point Objects 

Section 0 proposed a technique for capturing the movement of point objects 
that utilized polylines, and the section characterized the error introduced by 
this technique, this way also revealing the uncertainty inherent to the polyline 
representation. This section’s objective is to provide a format for representing 
the history of the positions of continuously moving point objects, along with the 
uncertainty associated with our records of their positions. For this, we propose 
a relational database schema that incorporates all the spatiotemporal and error 
information previously presented in this paper. 

Specifically, the schema in Table El defines relations for objects, for the line 
segments constituting the trajectories of the objects, and for the error informa- 
tion associated with the recorded trajectories. Relation Object has attributes 



Table 2. Relational schema for capturing moving-point objects, their trajecto- 
ries, and associated error information 

Object < objected, max_speed, etc. > 

Line_segment < lineJd, objectjd, ti, t2, X\, X2, yi, y2, errorJd > 

Error < errorjd, error_type, paraml, param2 > 



objectJd, which is the key attribute, and max_speed, which determines the 
maximum speed at which the object can move. In addition this relation may 
include any number of attributes unrelated to the objects’ spatial extents. Re- 
lation Line_segment captures the line segments that compose the trajectories 
of the objects. Attribute lineJd is the key attribute; objectJd is a foreign key 
referencing relation Object; and t\ and t2 are the times when the two positions, 
(xi,yi) and (x2,y2), constituting the line segment, were measured. Finally, re- 
lation Error contains the error information associated with the line segments. 
Attribute error Jd is the key; errorJype specifies the type of error that a tuple 
refers to, and thus specifies how parameters paraml and param 2 are to be in- 
terpreted. In the current schema, there is only one type of error. However, if we 
consider more error sources in our application, additional types of errors may 
occur. 

The domains of the attributes are as follows. Define dom{x) to be a function 
that returns the domain of its argument attribute x. Then <ioTO(object Jd) = 
dom( (line Jd) = dom(error Jd) = doTO(max_speed) = dom(ti) = dom(t2) = Af, 
where Af is the natural numbers, dom(paraml) = doTO(param 2 ) = Af U NIL, 
dom{xi) = dom{x2) = dom{y\) = dom{y2) = -Z, where Z is the integers, and 
dom(error_type) = {worst-case^ampling}. 

The following example illustrates how the above schema can be put to use. 
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Example 5. Our taxi company operates a number of taxis in a city. The database 
in Fig. 1^ captures the movement of the taxis together with the associated errors. 

This database permits the company to reconstruct the trajectories of its 
taxis and to compute the associated error information. All taxis are recorded 
in relation Object, and their trajectories are kept in Line_segment and are 
referenced through the foreign key objected. 



Object Line_segment 



object Jd 


maxjpeed 


line Jd 


object Jd 


error jd 


ti\t2\xi\x2\yi\v2 


1234 


120 


1 


1234 


1 




4321 


150 


2 


1234 


1 




1235 


140 


3 


4321 


1 





(a) (b) 



Error 



error Jd 


errorjype 


paraml 


param2 


1 


worst-case_sampling 


25 


NIL 



(c) 

Fig. 9. An example database containing positional and error information con- 
nected to a fleet management application of a taxi company 



To utilize the error information, the parameters for the various probability 
functions are recorded. The parameter of the worst-case sampling error, V3, is 
the radius r, which in our database is stored as the paraml attribute value (25) 
of the only tuple in relation Error. 

The parameters of the sampling error, V2, as shown in Table Q are the 
distance s, which is computed from the attribute values for Xi, X2, yi, and y2 
in relation Line .segment, together with the times ti, and t2- The maximum 
speed Vm is stored in relation Object. Finally, is not a static parameter that 
can be stored in a relation, but is an input parameter from a query. Thus, the 
intersection area A, which is different for each position in time contained in a 
particular line segment, can also only be computed once is known. 

5 Query Processing and Indexing 

The objective of this section is to explore the use of error information when using 
indices for processing queries involving the positions of moving objects. The 
section first sets the context within spatiotemporal indexing for its contribution. 
Subsequently, it shows how a moving-point index may be put to use in the 
processing of spatiotemporal range queries involving positional uncertainty. A 
discussion of what types of queries that can be answered in the given framework 
is given. The section ends with a summary of the section’s proposed approach. 
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5.1 Context 

The purpose of spatiotemporal indexing is to efficiently support the retrieval 
of those objects, from a large set of objects, with spatiotemporal extents that 
satisfy a specified query predicate. The most commonly considered predicate is 
intersection with a specified region. 

Substantial research is currently ongoing in spatiotemporal indexing, and a 
number of spatiotemporal indices have already been proposed; see Theodoridis 
et al. (HU for an overview. Although an index well suited for indexing the tra- 
jectories of the kinds of moving-point objects considered here still does not exist, 
it is expected that such an index will be invented. 

In terms of the representation proposed in this paper, this means that we 
can expect to be able to index the polyline segments that represent trajecto- 
ries. However, taking the uncertainty of the trajectories into consideration cor- 
responds to the indexing of (non-point) objects with spatial extents, and the 
envisioned moving-point indices are no longer readily applicable. 

Based on the assumption that it will be substantially more attractive to index 
the trajectories of moving-point objects than to index the trajectories of objects 
with spatial extents, which are more complex, this section offers an approach to 
using moving-point trajectory indices while taking into account the uncertainty 
of the trajectories and also taking into account query predicates relating to the 
uncertainty. 

The approach employs the fundamental technique from spatial indexing of 
using approximations for the spatial extents to be indexed (Giiting^. For in- 
stance, R-trees generally use minimum bounding boxes. This use leads to a filter- 
and-refine strategy for query processing. First, based on the approximations, a 
filtering step is executed that returns a superset of the objects fulfilling the query 
predicate. Second, in the refinement step, the exact extents of the objects re- 
sulting from the first step are checked against the query predicate (Brinkhoff et 
al. 0 - 

5.2 Processing Uncertainty Queries 

The goal here is to be able to use a moving-point index to answer queries such as 
“Retrieve the positions of taxis that were inside area A (specified as a rectangle) 
between times B and C with a probability of at least 30%?” 

The first step is to specify the meaning of an object’s position being within 
an area A with a probability of 30%. An object’s position is described by means 
of a probability function centered around the positional mean (e.g., recall the 
probability function of the worst-case sampling error in Fig. ETb)). 

If all of an object’s positional probability is within an area A, we say that 
the object is within area A for certain. We can determine if this is the case by 
integrating over the probability function with area A as the limit. If the result 
is 1, this is the case. 

If the object is within area A with a probability of at least 30%, at least 30% 
positional probability has be concentrated within area A. This case is shown in 
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Fig. EH where the circle represents the probability function of the worst-case 
sampling error, the rectangular shape is the query rectangle, and the shaded 
region represents the probability in the query window. The result of integrating 
over the probability function with rectangle A = {[xmin, Xmax], [yrmn,ymax]) as 
the limit thus has to be 0.3 or higher. Further, if the positional error is rota- 




Fig. 10. Summing up the probability 



tionally symmetric around the positional mean, as is the case for the worst-case 
sampling error, we can determine the maximum distance of the query window 
to the positional mean such that the probability of the position to be within the 
query window is 30% or higher. We term this computed distance the expansion 
measure. In Fig. [El this distance is indicated by the arrow from the edge of the 
query window to the center of the probability function (the mean). The expan- 
sion measure can be interpreted as the measure by which the query window has 
to be expanded to contain the positional mean (the dotted query window in the 
figure) . 

Note that we are assuming that the query window is longer and wider than 
2r, the diameter of the error distribution; later in this section, we will revisit this 
assumption. However, for now, we proceed to show how the expansion measure 
can be used in the filter step. 

The Filter Step We record the positions in time of the moving objects by 
means of line segments. The points on these segments are the mean values of 
the positional probability functions Our objective for the filter step is to retrieve 
those line segments that contain positions in time qualifying for the query result, 
i.e., those positions that, with the probability specified in the query, are in the 
query window. To retrieve these line segments, we intersect the expanded query 
window with the indexed line segments. Expanding the query window means 
that all positions with a probability higher or equal the one specified in the 
query (the one used to compute expansion measure) are contained in the query 
window. 
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For the filter step, the error measure used to determine the expansion measure 
can be coarse, but has to be universal so that it applies for all positions in the 
database. This is true for the worst-case sampling error described in Section E3 

As we shall see next, this method can only be applied if the probability 
specified in the query is less than 50%. 

Consider again the above query, but with a probability of 60%. Using the 
worst-case sampling error leads to a negative query expansion measure (cf. 
Fig. TO). If we use this smaller query window, we would retrieve a subset 
rather than a superset of the qualifying objects, since, e.g., positions that have 
no error (or a small error) and lie on (or close within) the borders of the query 
window would be disregarded. An example here is position P that has no error 
associated in Fig. TO- Shrinking the query window by the size of the nega- 
tive expansion measure would eliminate this position from our set of candidate 
solutions. 

This problem is solved by simply using the original query window with no 
expansion (shrinking) for probabilities higher than 50%. This means that we 
retrieve a superset of the qualifying objects. 




Fig. 11. Query window expansion: high probability 



The Refinement Step To determine the final result, we have to evaluate the 
query predicate on all objects identified during the filtering step. In our case we 
have identified line segments that intersect with the transformed query region. 
As the final answer, we would like to have a set of positions, and so the refinement 
step extracts those parts of the line segments that qualify for this set. 

In the filter step the intersection of line segments with the query window was 
determined with the help of the worst-case sampling error. To evaluate positions 
in time in the refinement step we will use the sampling error, unique for every 
position. A very straightforward way to achieve this is to apply the brute-force 
method of computing the probability functions in turn for all positions in time 
contained in a line segment (cf. Section I It. 41) and check whether at least 30% 
probability is concentrated within area A. Fig. II 21 shows two positions in time. 
Pi and P 2 , and their respective sampling errors (depicted by dotted lines). For 
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each of the positions, the probability concentrated within the query window 
is depicted by a shaded area. The set of solutions after the refinement step 



query window 




Fig. 12. Refinement step 



comprises all positions in time whose positional probability within a given query 
window is at least as high as specified in the query. 

On the Size of Query Windows Some query types deserve special attention 
within the presented framework. Point queries such as “Which taxis were in 
location A (point) at time B with 50% probability?” cannot be answered within 
this framework, since we cannot compute how much probability is concentrated 
within a point. 

However, some “point” queries actually might be translated into window 
queries, e.g., location A might refer to a road crossing or a waiting area for 
taxis. In this case we are confronted with a small- window query. 

Consider the above query where the query window of location A has an extent 
of, e.g., a taxi stand. If the sampling rate of the taxis’ positions was very coarse, 
the positions have a high degree of uncertainty, and the sampling error is very 
large. To find an answer to our query, we have to determine the positions for 
which at least 50% of the probability is concentrated within the query window. 
If the query window is too small with respect to the error measure, no positions 
will qualify, e.g., consider query window QWi in Fig. ITTSt 

On Non-Empty Query Results To derive a first minimum size of a query window 
for which the result is not guaranteed to be empty, we assume the worst case 
for both error and query. The largest possible error is the worst-case sampling 
error. The worst case of a query is to specify 100% probability, e.g., “Which 
taxis were in location A (point) at time B with 100% probability?” The smallest 
square query window we can consider has side length 2r, the diameter of the 
worst-case sampling error, e.g., QW 2 shown in Fig. 1 1 .tl If the query window is 
smaller, the probability of the worst-case sampling error cannot by contained 
entirely within the query window any more, i.e., less than 100% probability is 
concentrated within the query window, and the result is guaranteed to be empty. 
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QW2 




Fig. 13. Small-window query 



However, this only depicts the worst-case scenario. For query windows with 
side length larger than 2r, queries specifying varying degrees of uncertainty may 
have non-empty answers. If the query window is smaller and the query specifies 
100% probability, certain positions are eliminated because their associated error 
is too large for them to be for certain (100%) in the query window. Although 
this seems only to eliminate solutions, it has significant consequences for the 
use of the worst-case error with the query window in the filter step, as will be 
explained next. 



The Filtering Step Revisited In the case the query window is smaller than 2r, 
the filtering as outlined earlier might eliminate positions that satisfy the query 
predicate. Consider the example shown in Fig. da). First, we determine the 
expansion measure for a probability of 30% and the query window (the rectan- 
gle). The shaded area symbolizes the intersection of the error measure and the 
query window. The size of the area corresponds to the positional probability 
concentrated within the query window. 

Using this expansion measure, however, would exclude qualifying positions, 
e.g., position P would be discarded in the filter step, although 30% or more 
of its actual positional probability (dotted lens shape of the sampling error) is 
concentrated within the query window. 

To avoid the elimination of qualifying positions, we will initially expand the 
sides of small query windows that are smaller than 2r to be of size 2r and 
then use the resulting window to determine the expansion measure as describe 
earlier in this section. This is illustrated in Fig. d(b), where the probability 
concentrated in the window of height 2r is symbolized by shading, and where 
the expansion measure is symbolized by the longer arrow. With this measure, 
position P will be in the set of candidate solutions. To recap, small window 
queries are addressed as follows. A query window can be arbitrarily small. In 
connection with databases considering spatial uncertainty, the size of the query 
window is also determined by the uncertainty specified in the query. Further, 
the “extent” of a spatial position stored in the database is determined by the 
associated error measure. Consequently, when specifying a query, one has to keep 
all these measures in mind not to retrieve an unwantingly small result. 
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query window 




(a) 



(b) 



Fig. 14. Query window expansion: small query window 



5.3 Summary of Approach 

Or goal for this section was to give a method of how to use a moving point-index 
to process queries of the form “Retrieve the moving-object positions that were 
inside query rectangle A at some time between times B and C with a probability 
of at least X%.” The trajectories are indexed using a moving point index that 
supports range queries. A superset of the qualifying positions are retrieved in a 
filtering step, in which we expand the query window to retrieve all line segments 
containing positions that are in the query window with probability at least X%. 
The expansion is determined using probability X and the worst-case sampling 
error, stored in the database. 

In the refinement step, the positions contained in the retrieved line segments 
that actually are within query rectangle A with probability at least X% are 
identified. Here we use the sampling error, which is distinct for all positions. The 
following pseudo-algorithm summarizes the full retrieval procedure. We assume 
that the diameter of the worst-case sampling error is 2r. 

IF query window A has either height or width less than 2r THEN 
Increase the smaller side(s) to be of size 2r; 

IF X < 50% THEN 

Apply query window expansion to A; 

Let S contain the result of searching the index with A; 

Apply refinement to the line segments in S and return the resulting points; 



6 Conclusions and Future Research 

The paper investigates the representation of moving-point objects in databases. 
First, a set of queries derived from requirements to an application managing 
moving-point objects is presented. The Global Positioning System is the tech- 
nology used for obtaining samples of the positions of these objects. 

The paper proposes a method for acquiring and representing the movements 
of point objects. The positions of objects are sampled at selected points in time. 
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and the positions in-between these points in time are obtained using interpola- 
tion, thus capturing the complete movement. 

The representation of movements is inherently imprecise, and the paper con- 
siders two types of errors, the measurement error and the sampling error. Two 
measures were derived for the sampling error, one pertaining to each position in 
time, and a global worst-case error. It is further shown that the measurement 
error can be ignored in the application context considered. 

A database schema is proposed that incorporates both the polyline represen- 
tation of movements as well as the parameters of the various error distributions 
associated with the polyline representation. The schema is illustrated by an ex- 
ample database suited for the taxi management application. 

Finally, the paper shows how to use this database to answer spatiotemporal 
queries derived from the example application. The error information is used 
in connection with an arbitrary moving-point index to answer spatiotemporal 
queries using the standard filter-and-refinement process. 

This work points to several directions for future research. First, for the repre- 
sentation of the movement, we chose to linearly interpolate in-between measured 
positions. More advanced techniques may be used for this purpose as well, e.g., 
polynomial splines. Second, two types of error measures were considered, namely 
the measurement error and the sampling error. Additionally manipulating the 
measured positions before storing them in the database introduces another error 
that needs to be considered. Thus, generalizing the approach to an arbitrary 
number of error measures poses an interesting challenge. Third, in our work 
we only consider uncertainty in the spatial dimensions (cf. Section 13.211 . This 
is partly because of the high precision with respect to time of the positional 
measurement device, GPS, we use. Using motion sensors or other techniques 
instead poses the question of quantifying uncertainty with respect to time as 
well. Fourth, one of the underlying assumptions in our work is that objects are 
not restricted in their movements through space. In reality, the space considered 
will typically contain roads, railroad tracks, walls, floors, mountains, lakes, or 
other “infrastructure” that facilitate or inhibit movement. This infrastructure 
may be taken into account to yield a reduced overall uncertainty and error in 
the database, as well as other benefits. 
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Abstract. Intelligent Mobile Information Systems support information- 
centered applications that require support for a large number of dis- 
tributed mobile users collaborating on a common mission and with in- 
terests in a common situation domain. A mobile user operating in the 
field changes location, consumes resources, investigates situations “on 
the horizon,” and performs other incrementally evolving activities. A 
mobile user’s information needs are therefore continually evolving in a 
neighborhood of interrelated data centered on the user’s current location. 
Broadcast data dissemination is most effective when each broadcast in- 
formation packet has multiple interested parties. To maximize the value 
of multicast dissemination, we dynamically cluster similar user profiles 
into aggregate user classifications that are served by independent multi- 
cast channels of custom information packets. Mobile user locations are 
also continuously tracked and mapped onto a cartographic representation 
of the real scenario. Spatial proximity between users is then computed 
by taking into account real boundaries as described in the cartographic 
map. Spatial information and spatial relationships among mobile users 
are then provided to the clustering algorithm with an eventual improved 
quality of the disseminated data. 



1 Introduction 

Intelligent Mobile Information Systems |19I18| support information-centered ap- 
plications that require support for a large number of distributed mobile users 
collaborating on a common mission and with interests in common situation do- 
mains. Most existing work focus on access methods to spatial databases that 
contain pre-recorded situation information On the other hand, we aim to 
provide support for proactive and adaptive multicast-based dissemination of real- 
time situation information. The underlying basis behind creating a multicast 
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(DARPA) under the Battlefield Awareness Data Dissemination (BADD) Program. 
The view and conclusions contained in this document are those of the authors, and 
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service for sharing mobile information is to efficiently disseminate relevant in- 
formation to users that are in close proximity, and hence likely share a common 
interest in transportation information. 

A mobile user operating in the field changes location, consumes resources, in- 
vestigates situations “on the horizon,” and performs other incrementally evolving 
activities. A mobile user’s information needs are therefore continually evolving 
in a neighborhood of interrelated data centered on the user’s current location. 
This need can be effectively satisfied by listening to the broadcast of situation 
update by fixed stations as well as other mobile users in the user’s neighborhood. 
Broadcast data dissemination is most effective when each broadcast information 
packet has multiple interested, or receiving, parties. To maximize the value of 
multicast dissemination, we dynamically cluster similar user profiles into aggre- 
gate group profiles that are served by independent multicast channels of custom 
information packets. In other words, given a collection of information interests 
of mobile users against a collection of information streams, a clustering process 
is performed to organize users into disjoint groups. Multicast communication 
channels are then created for user groups whose membership changes as mobile 
users migrate. Through the multicast channel assigned to the group he/she be- 
longs to, a user can make announcements to his/her fellow group members and 
obtain information from sources servicing the area covered by the group. 

One of the most important issues in realizing the information dissemination 
framework for mobile information sharing and dissemination is to balance the 
often antagonistic goals of maintaining accurate clustering, so that users receive 
highly relevant information, and economical utilization network bandwidth. At 
the same time, care has to be taken to accommodate constraints imposed by 
mobile wireless networks that form the communication infrastructure. In par- 
ticular, existing terrestrial wireless networks have limited bandwidth as well as 
support only a limited number of allowable multicast channels (if multicasting 
supported at all) . Such a framework is further complicated by the mobile nature 
of users which move around in the real world while participating at a multicast 
session. This implies that user’s interests, correlated to their current geographical 
location, vary across space and time. As a result, clustering have to be contin- 
uously modified to maintain its quality. Furthermore, mobile users’ move are 
generally constrained in the real world by natural and/or artificial boundaries 
such as roads, rivers, tracks, etc. Such boundaries should be exploited to help 
guide decisions on whether users are “similar” and hence improve the quality of 
clustering. 

In this paper, we present a framework that combines knowledge discovery 
and spatial database techniques to realizing effective mobile information sharing 
given the real-life limitations of existing mobile wireless networks. In Section 2, 
we describe the fundamental cartographic map representation and its usage to 
dynamically compute real-world distance among mobile users in the system. In 
Section 3, we describe an algorithm to dynamically cluster mobile users based 
on their spatial domain of interest and compare it with previous clustering algo- 
rithms appeared in the literature. Users’ locations are continuously mapped onto 
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the cartographic representation of the actual scenario. Sections 4 and 5 present 
simulation results and the current prototype implementation of our intelligent 
mobile information system respectively. Finally, Section 6 concludes the paper 
by summarizing the contributions of this research. 



2 Geographic Information Modeling 

As already mentioned, under very many circumstances users move around in the 
real world following well defined natural and/or artificial routes such as roads, 
rivers, tracks, air channels, etc. Often spatial clustering has been solely based 
on Euclidean distance among moving entities. This measure may be ineffective 
in real world situations where, rarely, actual distance between pairs of entities 
moving along defined paths coincides with their Euclidean distance. For instance, 
in Figure Q user a is moving along the same route of 6, while c is moving on 
another route. According to the actual road distance, a is closer to b than to c 
(even though the Euclidean distance computation would suggest the opposite), 
therefore, a would be more likely clustered together with b since his/her interest 
in, say, road information may be closer to 6’s. 




Identifying user’s location on a cartographic map, and relating this loca- 
tion to the one of other users, improve user profile prediction with a subsequent 
more relevant information being multicasted to specific users. In order to in- 
clude spatial information in our profile clustering algorithm, we superimposed 
a cartographic representation of the real scenario to our system. Mobile users’ 
locations are then mapped onto such a cartography with a subsequent richer set 
of information being provided to the clustering algorithm. 

Henceforth, in our system user profiles are augmented with information such 
as: user’s location, movement direction, speed. Furthermore, distance among 
entities is the real-world physical distance based on the map representation by 
taking into account the actual shape and length of real routes. These variables 
are then included in the objective function of the profile clustering algorithm 
discussed above. 
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2.1 Cartographic Map Representation 

In our system, users’ locations are continuously mapped onto a cartographic 
map. Distance among moving objects corresponds to their physical distance 
computed in the map by following the real routes reproduced in the map. Hence, 
we assume that entities move along such routes and these latter ones are properly 
represented in the cartography at hand. Routes in the map are represented as 
polylines in a vector based representation of a GIS. Each polyline is described by 
the set of its vertices coordinate pairs: (< xq, j/o >i < xi, >,...,< x„, ?/„ >). 

The cartography needs to be modeled by a data structure that can guaran- 
tee an efficient computation of the distance between each pair of objects (the 
spatial clustering objective function is based upon such a distance. To satisfy 
such a requirement we adopt a modified version of a spatial network |17| (al- 
ternatively called Spatial Graph) . Spatial networks have been exploited in many 
different domains as the kernel data representation; examples of these domains 
are: transportation systems, air traffic management, urban management, as well 
as all the different types of utility networking such as power, telephone, water, 
and gas. A spatial network is a richly connected graph where nodes are labeled 
with a geographical (x, y) location of the specific entity to be modeled, and edges 
link pairs of entities. As an example, a spatial graph for a road transportation 
system would be modeled by having a node for each road intersection and an 
edge for each road segment connecting two intersections. Nodes would be la- 
beled with Euclidean (x, y) coordinate of the intersection, while edges would be 
associated to various information depending upon the target application (e.g., 
length of the road segment, name of the street, type of street, etc.). Note that 
spatial networks are also suitable to model three dimensional spaces. 

In order to fully capture the dynamic nature of our mobile user based applica- 
tion we modified the standard spatial network model into a finer representation. 
In our system, a spatial graph G is created by a two-step procedure as described 
in the following. Firstly, apply to each polyline p = (< xq,?/o >)< xi,j/i > 
Xn,yn >) in the input map the following: 

1 . Create a class ko in G and label fcg with “< Xq , yo >” ; 

2. For each vertex < Xj,yj > (j = 1, 2, . . . , n) of p: 

(a) Create a new class kj and label it with “< Xj,yj >”; 

(b) Create an edge e connecting kj to kj-i and label e with the Euclidean 
distance between < Xj,yj > and < Xj-i,yj-i >. 

Once the graph G is created apply the following recursive procedure to G. 

1. While there is an edge e of G, connecting the classes ki and fc 2 , such that 

e.label > MAX_EDGE_LENGTH (where MAX_EDGE_LENGTH is a 

user’s defined threshold) do the following: 

(a) Remove e from G; 

(b) Compute < Xm ,ym> ss the mid-point coordinates of the two locations 
identified by (the label of) k\ and k 2 ] 

(c) Create a new class km with label “< Xm,ym >”; 
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(d) Create a new edge ei connecting k\ to km and label it with e.label 2; 

(e) Create a new edge C 2 connecting km to ^2 and label it with e.label 2; 

This procedure creates a set of spatial graphs: one for each polyline in the 
input map. Such set of graphs needs then to be combined into a single large 
fully connected spatial graph where road intersections are properly represented. 
Figure El depicts two polylines crossing each other: circles in the figure represent 
nodes in the spatial graph, notice that nodes exist for each vertex in a road as 
well as for the intersection. 

In essence, our spatial graph has the following properties: 

1. A node for each route vertex (node a in Figure EI); 

2. A node for each road intersection (node b in Figure EI); 

3. Intermediate nodes are created for long straight road segments in order 
to make sure that each segment is shorter than a certain threshold (i.e., 
MAX_EDGE_LENGTH) defined by the user (node c in Figure E). 

4. Edges connect adjacent nodes if a road segment exists connecting the two 
corresponding geographical locations. 

A spatial graph in our system is basically an approximation of the coordinate 
space in the input map for all the points belonging to a route in the map. The 
approximation ratio is controlled by the parameter MAX_EDGE_LENGTH 
which is set by the user. The value of MAX_EDGE_LENGTH depends upon 
the current application and the accuracy of the related equipment (e.g., GPS 
tolerance). 

At each refresh point, a mobile user’s location is mapped to a particular 
class of the spatial graph. Henceforth, user’s movement history is described by 
the sequence of classes visited by the user. 




2.2 Modeling Interest Profiles of Mobile Users 

A simple way of modeling an interest in information pertaining to a geographic 
region is to use a rectangular bounding box. While convenient, information re- 
quests modeled as rectangular bounding boxes can be inaccurate, especially for 
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that of a group of users clustered together. We address the problem of the inef- 
ficiencies in using rectangular bounding boxes for clustering entities by marking 
off spatial coverage using grids. The entire playing area is divided into a rect- 
angular grid and the geographic area of interest of each mobile user lies in its 
neighborhood along some path. 

Once the spatial graph is created it can be used at run-time to dynamically 
compute distance among mobile users. At each refresh point, coordinates of mov- 
ing entities are mapped into classes of the spatial graph G. That is, each entity 
location is mapped to the closest point described by a class of G. A neighbor- 
hood of a mobile user is the set of classes in the spatial graph that are within a 
threshold distance from the user’s location at a specific time. In Figure 0 three 
objects (represented by a circle) and their respective neighborhoods (represented 
by the set of small squares) are depicted. Neighborhoods can easily be computed 
on the spatial graph by a linear time graph visit algorithm dni. 




Vicinity among users is then defined based on the number of overlapping 
classes in their respective neighborhoods, the more classes overlap between the 
two neighborhoods the (spatially) closer the two users are. The adopted graph 
based representation discussed above allows an easy and efficient computation of 
the vicinity dimension which is then used by the profile clustering algorithm. The 
adopted spatial graph representation is suitable to compute distance between all 
possible pairs of mobile users by exploiting a conventional graph shortest path 
algorithm m- However, empirical results have shown us that due to the large 
number of moving objects to be modeled, computing the actual distance between 
all possible pairs would be too costly for a dynamic real-time application. 



3 Dynamic Profile Clustering 

A real-time mobile information dissemination environment generally consists of 
the mobile users with their specialized information requests and the many in- 
formation streams created by the users and stationary data collection stations 
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updating the situation in the field. Profile clustering is the process of incremen- 
tally clustering user information requests into a collection of group requests each 
covering all of its member users’ requests. 

Profile clustering is a conceptual clustering problem that has been studied 
extensively by the artificial intelligence community as a means to extract hid- 
den inference information [^, and by the information retrieval community as 
a method to improve the quality of query results |5|. Clustering systems algo- 
rithms (e.g., COBWEB |5|, CLUSTER/2 [ig, and AUTOCLASS 0) share the 
common goal of discovering structure in data but differ in the objective function 
used to evaluate clustering quality and the control strategy used to search the 
space of clusterings. 

Our algorithm to generate profile clustering to allow effective information 
dissemination differs from existing conceptual algorithms in that the goal of the 
clustering is more than only to optimize the accuracy of clustering. In addition 
to accuracy in the clustering, also important in incremental profile clustering are 
simplicity (small number of groups), and stability (infrequent reassignment of 
user requests to groups). We desire a simple clustering networks generally have 
a practical limit on the number of multicast channels that can be supported. At 
the same time, it is important for the profile clustering to not change frequently 
because a large number of users are potentially interrupted when the clustering 
changes, while re-materializing a changed clustering in the form of new multicast 
channels can be an expensive process. Our algorithm alleviates the problem of 
instability by sacrificing some of the optimality in the clustering through the use 
of a request group cover diameter tolerance. In addition, we allow the bounding 
of the number of groups that translates to the number of multicast channels that 
is limited in a real network. 

3.1 Static Clustering Framework 

The submission, update, or cancellation of user requests triggers the profile clus- 
tering process. Given the needs of profile clustering for intelligent mobile infor- 
mation sharing, the algorithm to reconfigure an existing profile clustering when 
a new request is submitted consists of 3 main steps: 

1. Try to find an existing group in the clustering that closely covers 
the request. A profile group covers a request if the group request completely 
subsumes the request. A threshold, COVER-THRESHOLD, is defined as a 
means to control the inaccuracy of clustering that can be tolerated. A request 
is assigned to a group that completely covers it only if the difference between 
the their coverages is below the threshold. For example, if the threshold 
equals 0, a request is assigned to an existing group only if they have exactly 
the same coverage. On the other hand, if the threshold is 1, a request can 
be assigned to any existing group in the profile clustering. 

2. Try to expand existing group that is the closest in coverage to the 
request. A threshold, EXPANDTTHRESHOLD, is defined to control the 
extent of expansion in a group’s coverage that is allowed. Specifically, the 
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group that is closest to the request in coverage cannot be expanded to include 
the request if the increase in the group’s coverage from its original coverage 
is larger than the threshold. If the threshold is 0, no group is allowed to 
expand in coverage. On the other hand, groups can be expanded arbitrarily 
if the threshold is 1. The threshold introduces a tunable means of allowing 
inaccuracy in clustering as a tradeoff for profile clustering stability. 

3. If all else fails, generate a new profile group that covers the user 
request. The new group will serve as a new locus of clustering to which new 
user requests are attracted to. Moreover, it may later be merged with other 
groups as its coverage migrates due to changes in its membership and the 
coverage of its members. 

3.2 Adaptive Reclustering 

Once the objects are clustered, we try to preserve the original clusters as far as 
possible - as objects move or require new information, the group they are in is 
modified to reflect these changes. However, over a period of time this may lead 
to improper clustering of entities. The entities in a particular group may drift far 
apart enough from each other such that the coverage area of the group becomes 
undesirably large. Also, two clusters may change their coverage such that their 
information coverage overlaps to an extent that it would be justified to merge 
the two clusters into a single cluster. 

We tackle the dynamics of user profiles by using a two-phased approach 
towards maintaining group assignments. Periodically, the assignment of entities 
to clusters is “reviewed” . This review consists of two phases: a splitting phase 
and a merging phase. 

— Splittiug Phase. In the splitting phase, each group and the entities in it 
are examined and if the entities in the group have moved significantly away 
from each other, the group is split into multiple smaller clusters. We tackle 
the task of deciding whether a group should be split into multiple clusters by 
reclustering the entities in the group within themselves. If the entities in the 
group should logically belong to a single cluster, the reclustering will result 
in a single cluster. If, on the other hand the entities in the group have moved 
far apart from each other since the last time the clusters were reviewed, so 
much so that their being in the same group cannot be justified, the result of 
the reclustering will be two or more smaller clusters consisting of subsets of 
the entities in the original cluster. 

Reclustering the entities within a group is less expensive because the number 
of entities being reclustered is a lot smaller than the total number of entities 
in the system. Within a group, the first entity picked should be one on an 
extreme corner of the space defined by other entities (the one on the bottom 
left corner, say) If there is no single entity at the corner, an entity closest 
to one of the corners may be chosen as the first one to cluster. All other 
entities in the system should be ranked in order of increasing distance from 
the entity we start with. The iterative procedure cycles through the entities 
in this order. 
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— Merging Phase. In the merging phase, the clusters (subclusters generated 
after the first phase) are examined. If two or more clusters overlap to a 
sufficient degree that it is justifiable to merge them into a single cluster, 
they are merged into one. Considering the merger of one group into another 
is similar to that of merging an entity into a group described earlier. The 
ratio of cells in the spatial coverage of the group to be merged that are not 
covered by the group it is being merged into to the total number of cells 
covered by the group to be merged is computed. If this ratio is less than the 
parameter GROUP-MERGE J'HRESHOLD, the clusters can be merged, 
based upon the spatial coverage attribute. In other words, as the threshold 
increases, the less overlap 2 groups have to have for them to be merged and 
the more relaxed the merging criteria is. 



4 Experiment Results 

The profile clustering algorithm contains a number of parameters that can be 
adjusted to tune the balance between the overhead, accuracy, and simplicity of 
the clustering generated. They include the reclustering frequency, the cluster 
merging threshold, and the group joining threshold. One of the most important 
parameters that can be used to adjust the quality of clustering is the group merg- 
ing threshold that controls the allowable overlap between overlapping groups 
before they are combined into an aggregated group. In this section, we will show 
some simulation results and discuss the effects of changing one of the parameters, 
namely the cluster merging threshold, on mobile information dissemination. 

Given fixed GOVER.THRESHOLD of 0.1, EXP AND -THRESHOLD of 
0.1, and reclustering period of 10 seconds. Figure 0 shows a plot of the average 
number of groups for different group merging thresholds The simulated user en- 
vironment consists of 30 vehicles following random paths in the road grid shown 
in Figure Q As expected, the number of groups decreases as the group merging 
threshold increases since groups are allowed to be merged only with a decreasing 
amount of overlap. More interestingly, since each group maps into a multicast 
channel in our information dissemination system, a limit on the number of al- 
lowed multicast channels in the networking infrastructure imposes a limit on the 
maximum group merging threshold. For example, if the network infrastructure 
limits the number of multicast channels to 12, then the group merging threshold 
should be set to at least 0.55 given that all the other parameters remain the 
same. 

Figure El shows a plot of the total area covered by the groups for differ- 
ent group merging thresholds. In this case, groups includes the coverages of its 
members and all 5x5 unit grids along the map-graph connecting them. The to- 
tal coverage size increases as the group merging threshold increases and group 
count decreases, since more clustering causes more space “between” users to be 
included in groups. A group merging threshold of 0.55 as dictated by a multicast 
channel limit of 12 results in a total group coverage of 870 units. Assuming that 
approximately the same amount of data are being generated by mobile users 
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Fig. 4. Plot of Group Count against Group Merging Threshold 



and data collection centers to describe the situation at each area, a plot show- 
ing the network bandwidth requirement will have the same shape as Figure El 
for varying group merging thresholds. This information can be used, with the 
multicast channel limit, to determine the appropriate group merging threshold 
to use given different network bandwidth availability and allocation. 




Fig. 5. Plot of Total Group Area against Group Merging Threshold Using Grid- 
Based Groups 



Similar to Figure El Figure El also shows a plot of the total area covered 
by the groups for different group merging thresholds using the same clustering 
parameters. The difference is that in this case, the spatial coverage of a group 
is the rectangular bounding box of its members’ coverage. A group merging 
threshold of 0.55 as dictated by a multicast channel limit of 12 results in a total 
group coverage of 7280 units, which is much higher than that achieved when 
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group profiles are modeled as a collection of grids. Interesting, the total group 
coverage increases much faster as the merging criteria is loosened before leveling 
off. The two facts combine to indicating that bounding boxes is ineffective as a 
model for user groups, even for lightly clustered environments. 




Fig. 6. Plot of Total Group Area against Group Merging Threshold Using 
Bounding Boxes 



5 Mobile Information Dissemination System Prototype 

We have implemented a prototype of the intelligent mobile information system 
described in this paper in a hybrid satellite-terrestrial networking environment ^ 
that combines high bandwidth satellite networking services (e.g., DirecPG) and 
terrestrial wireless networks d to support rapid deployment, user mobility, and 
wide-area support. It supports the dissemination and maintenance of extended 
situation awareness throughout such a network information infrastructure by 
means of adaptive reliable multicast-based information dissemination control 
from a clustering of mobile users. 

The scalable group communication model of IP multicast forms a natural 
basis for large-scale data dissemination. However, IP multicast is unreliable, and 
a reliable multicast based reliable group data distribution protocol has to be 
used to effectively support dissemination of important information that demands 
absolute delivery reliability. We adopted the Multicast Dissemination Protocol 
(MDP) jZIj for reliable multicast-based information dissemination. MDP is a 
receiver-reliable sender-oriented scheme that implements a NAGK suppression 
algorithm in which receivers listen to the multicast channel for a NAGK multi- 
casted by another receiver requested re-multicast of a data item if it recognizes 
the loss of the piece of data. The algorithm allows NAGKs to be amortized since 
multiple receivers often miss the same data items unless the data loss happens at 



Dynamic Spatial Clustering 143 



the “last mile” to a particular receiver. MDP is particularly suitable for situation 
awareness environments in which usres are resource-limited. 

In addition to modeling user interest simply based on its current location, 
the profile of a mobile user includes the expected travel paths of the users in 
addition to the spatial interest coverage around their location. The expected rate 
and direction of change in users’ geo-spatial interest domain serves as a basis for 
providing support for proactive information dissemination that anticipates users 
information needs and seamlessly fill each client’s cache with neighborhood, or 
likely to be relevant, situation information. In particular, the x- and y- velocity 
vectors are included as dimensions in the profile space for the purpose of cluster- 
ing. The intuitive idea would be to cluster entities moving in the same direction 
(more or less) together and ones moving rapidly apart from each other in differ- 
ent groups so that groups tend to preserve each other for longer periods of time. 
By taking into account of expected changes in user profiles, profile-oriented data 
dissemination achieves predictive information push that anticipates future user 
needs and minimizes latency of data request by making data available before 
they are explicitly requested. 

Even though the user profile captures expected temporal changes in cover- 
age of the interest domain, the actual future interest need may still differ. Plain 
dead-reckoning is a simple position update policy for maintaining accurate user 
location despite diversion in trajectory from plan. A mobile user updates its 
position, and hence its information request, when it deviates from the expected 
position over some pre-defined threshold. From the user’s perspective, updating 
a request results in the possible reassignment of the request to a different group 
in the profile clustering. In addition, groups get merged, splitted, and deleted, 
during profile reclustering even if mobile users do not do anything. A user is 
notified of changes when the profile group, and hence the multicast information 
channel, it is assigned to changes. While seemingly complicated, all the above 
interaction between a user and the information dissemination infrastructure is 
transparently handled by the client software that updates the user’s profile and 
provides a seamless transition between multicast channels when its channel as- 
signment changes. 

As a simulation and monitoring tool, we have implemented a dynamic clus- 
tering monitor software that displays the changes in the grouping of mobile 
objects over time. Figured shows a screen dump of the tool monitoring a run of 
the experiment described in the last Section. 

5.1 Applications 

Intelligent mobile information sharing has many real-life military, civil, and com- 
mercial applications. Battlefield situation awareness and emergency response are 
two important information-centered applications that require the capability to 
effectively disseminate multimedia information to large groups of geographically 
distributed mobile users collaborating on a common mission and with interests in 
common situation domain. For example, after a natural disaster, emergency re- 
sponse teams and residents of affected areas need to share situation information 
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Fig. 7. Screen Dump of Dynamic Clustering Monitor Software 



(e.g., road condition) as well as their up-to-date locations in order to coordinate 
evacuation; at the same time, they also receive a large variety of multimedia 
information from outside source (e.g., real-time video from CNN, satellite sen- 
sor data from NASA, weather maps and forecasts) to enhance decision making. 
Shared situation awareness, during real-time mission execution, will be achieved 
by a hierarchical propagation of information throughout the operational organi- 
zation. To effectively adapt and react to rapidly evolving scenarios, units at all 
levels of command must perceive an extended awareness of the situation and of- 
ten act autonomously while remaining globally consistent in the overall mission 
objective. 

At the same time, future passenger vehicles will be equipped with networking 
facilities that allow their occupants to remain connected to the Internet while 
on the move. The value of advanced traveler information systems can be greatly 
increased by extending beyond providing simple traffic related information and 
guidance, to support scalable multimedia dissemination services to vehicles for 
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informational (e.g., fleet management and communication) and entertainment 
(e.g., digital television services) purposes. 

6 Conclusions 

In this paper, we described the design of an intelligent mobile information system 
that runs on a hybrid satellite-wireless mobile network. It satisfies the unique 
information dissemination requirements of situation awareness and emergency 
response applications and effectively utilizes available bandwidth by implement- 
ing user profile aggregation that incrementally aggregates users into communities 
sharing common interests to enable multicast-based information dissemination. 

We reach higher quality of disseminated information to mobile users by track- 
ing continuously their spatial location onto a cartographic representation of the 
real scenario. Reasoning is then performed on such a cartography in order to im- 
prove the clustering criteria by means of users’ spatial relationships computed 
on the real map. 

We have discussed the effects of tuning the group merging threshold on clus- 
tering and information dissemination performance. There are many more differ- 
ent dimensions of optimizing dynamic clustering under different environments 
and constraints. In addition to the threshold for joining groups that we have 
explored and discussed in the last section, we can also adjust other parameters 
of the dynamic clustering algorithm (e.g., temporal frequency of updating clus- 
ters, spatial “grid” size, frequency of splitting and merging clustering, etc.) for 
different system environments. 
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Abstract. An efficient benchmarking environment for spatiotemporal access 
methods should at least include modules for generating synthetic datasets, 
storing datasets (real datasets included), collecting and running access 
structures, and visualizing experimental results. Focusing on the dataset 
repository module, a collection of synthetic data that would simulate a variety 
of real life scenarios is required. Several algorithms have been implemented in 
the past to generate static spatial (point or rectangular) data, for instance, 
following a predefined distribution in the workspace. However, by introducing 
motion, and thus temporal evolution in spatial object definition, generating 
synthetic data tends to be a complex problem. In this paper, we discuss the 
parameters to be considered by a generator for such type of data, propose an 
algorithm, called ""Generale_Spalio_Temporal_Data” (GSTD), which generates 
sets of moving point or rectangular data that follow an extended set of 
distributions. Some actual generated datasets are also presented. The GSTD 
source code and several illustrative examples are currently available to all 
researchers through the Internet. 



1. Introduction 

A field of ongoing research in the area of spatial databases and Geographical 
Information Systems (GIS) involves the accurate modeling of real geographical 
applications, i.e., applications that involve objects whose position, shape and size 
change over time. Real world examples include storage and manipulation of 
trajectories, fire or hurricane front monitor, simulators (e.g., flight simulators), 
weather forecast, etc. 

Database Management Systems (DBMS) should be extended towards the efficient 
modeling and support of such novel applications. Towards this goal, recent research 
efforts have aimed at: 

- modeling and querying time-evolving spatial objects (e.g., [19, 3, 24]), 

- designing index structures and access methods (e.g., [13, 25]), 

- implementing appropriate architectures and systems (e.g., [26]). 

R.H. Giiting, D. Papadias, F. Lochovsky (Eds.): SSD’99, LNCS 1651, pp. 147-164, 1999. 
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In the recent literature, one can find work on formalization and modeling of 
spatiotemporal databases and a wide set of definitions about spatiotemporal objects. 
In the rest of the paper, we adopt the discrete definition for spatiotemporal objects 
that appears in [22] ; 

Definition'. A spatiotemporal object, identified by its o_id, is a time-evolving 
spatial object, i.e., its evolution (or ‘history’) is represented by a set of instances 
(o_id, 5;, t), where 5 , is the location of object o at instant f ( 5 , and t. are called 
spacestamp and timestamp, respectively). 

According to the above definition, a two-dimensional time-evolving point (region) 
is represented by a line (solid) in three-dimensional space. Figure 1 illustrates two 
examples: (a) a moving point and (b) a moving region, according to the terminology 
proposed by Erwig et al. in [3]. Although in the rest of the paper, we consider objects 
of dimensionality d = 2, the extension to higher dimensions is straightforward" 





(a) moving point 



(b) moving region 



Fig. 1. Two-dimensional time-evolving spatial objects 

One of the tasks that a SpatioTemporal Database Management System (STDBMS) 
should definitely support includes the efficient indexing and retrieval of 
spatiotemporal data. This task demands robust indexing techniques and fast access 
methods for a wide set of possible queries on spatiotemporal data. Either extensions 
of existing spatial access methods [27, 23, 13, 25] or new ‘from-the-scratch’ methods 
could be reasonable candidates. All proposals, however, should be evaluated under 
extensive experimentation on real and synthetic data. For instance, query processing 
and/or index building time (either real wall-clock time, or number of disk I/Os), space 
requirements and combinations thereof are all possible parameters against which one 
may want to evaluate a given index proposal. 

Overall, there is a lack of consistent performance comparison among the proposed 
approaches, with respect to the space occupied, the construction time, and the 
response time in order to answer a variety of spatial, temporal, and spatiotemporal 
queries. Moreover, Zobel et al. [28] suggest that “experiments of indexing techniques 
should be based on benchmarks such as standard sets of data and queries”. 



Popular examples of spatial datasets with dimensionality d >2 include, among others, virtual 
reality worlds (d = 3) and feature-based image databases (usually d < 256). 
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Following that guideline, the general architecture of a benchmarking environment 
for spatiotemporal access methods (STAMs) that is currently under design includes 
the following: 

a) a module that generates synthetic data and query sets, which would cover a variety 
of real life examples, 

b) a repository of real datasets (such as the extensively used TIGER files for - static - 
spatial data), 

c) a collection of STAMs for experimentation purposes, 

d) a database of experimental results, and 

e) a visualization tool that could be able to visualize datasets and structures, for 
illustrative purposes. 

Our study continues an attempt towards a specification and classification scheme 
for STAMs initiated in [22]. Within the above framework, in this paper we 
concentrate on module (a) and, in particular: 

- discuss parameters that have to be taken into consideration for generating 
spatiotemporal datasets, and 

- propose an algorithm that generates datasets simulating a variety of scenarios with 
respect to user requirements. 

The rest of the paper is organized as follows: In Section 2 we discuss the 
motivation for this study. Section 3 discusses the parameters that need to be taken into 
consideration. An appropriate algorithm is presented in Section 4 together with 
example results and applications. Section 5 discusses several issues that arise and 
surveys related work. Finally, Section 6 concludes by also giving directions for future 
work. 



2. Motivation 

In the literature, several access methods have been proposed for spatial data without, 
however, taking the time aspect into consideration. Those methods are capable of 
manipulating geometric objects, such as points, rectangles, or even arbitrary shaped 
objects (e.g., polygons). An exhaustive survey is found in [4]. On the other hand, 
temporal access methods have been proposed to index valid and/or transaction time, 
where space is not considered at all. A large family of access methods has been 
proposed to support multiversion / temporal data, by keeping track of data evolution 
over time (e.g., assume a database consisting of medical records, or employees’ 
salaries, or bank transactions, etc.). For a recent survey on temporal access methods 
see [17]. 

To the best of our knowledge, there is a very limited number of proposals that 
consider both spatial and temporal attributes of objects. In particular, MR-trees and 
RT-trees [27], 3D R-trees [23], and HR-trees [13] are based on the R-tree family [7, 1, 
9] while Overlapping Linear Quadtrees [25] are based on the Quadtree structure [18]. 
These approaches have the following characteristics: 

- 3D R-trees treat time as another dimension using a ‘state-of-the-art’ spatial 
indexing method, namely the R-tree, 
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- MR-trees and HR-trees (respectively, Overlapping Linear Quadtrees) embed the 
concept of overlapping trees [12] into R- trees (Quadtrees) in order to represent 
successive states of the database, and 

- RT-trees couple time intervals with spatial ranges in each node of the tree structure 
by adopting ideas from TSB trees [11]. 

The majority of proposed spatiotemporal access structures are based on the R-tree 
(one exception is [25]), as such we focus on such structures and a short survey of the 
R-tree based approaches follows. 

Assuming time to be another dimension is a simple idea, since several tools for 
handling multidimensional data are already available [4]. The 3D R-tree implemented 
in [23] considers time as an extra dimension in the original two-dimensional space 
and transforms two-dimensional rectangles in three-dimensional boxes. Since the 
particular application considered in [23] (i.e., multimedia objects in an authoring 
environment) involves Minimum Bounding Rectangles (MBRs) that do not change 
their location through time, no dead space is introduced by their three-dimensional 
representation. However, if the above approach were used for moving objects, a lot of 
empty space would be introduced (Figure 2). 




Fig. 2. The MBR of a moving object occupies a large portion of the data space 

The approach followed by the RT-tree [27] only partially solves that problem. 
Time information is incorporated, by means of time intervals, inside the (two- 
dimensional) R-tree structure. Each entry, either in a leaf or a non-leaf RT-tree node, 
contains entries of the form {S, T, P), where S is the spatial information (MBR), T is 
the temporal information (interval), and F is a pointer to a subtree or the detailed 
description of the object. Let T = (t , t), i < j, t. be the current timestamp and t ^ be the 
consecutive one. If an object does not change its spatial location from t. to then its 
spatial information S remains the same, whereas the temporal information T is 
updated to T’, by increasing the interval upper bound, i.e., T’ = (t, t.^J. However, as 
soon as an object changes its spatial location, a new entry with temporal information 
T = (t^j, f^j) is created and inserted into the RT-tree. This insertion strategy makes the 
structure mostly efficient for databases of low mobility; evidently, if we assume that 
the number of objects that change is large, then many entries are created and the RT- 
tree grows considerably. An additional criticism is based on the fact that R-tree node 
construction depends on spatial information S while T only plays a complementary 
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role. Hence the RT-tree is not able to support temporal queries (e.g., “find all objects 
that exist in the database within a given time intervaF). 

On the other hand, MR-trees and HR-trees are influenced by the work on 
overlapping B-trees [12], Both methods support the following approach; different 
index instances are created for different transaction timestamps. However, in order to 
save disk space, common paths are maintained only once, since they are shared 
among the structures. The collection of structures can be viewed as an acyclic graph, 
rather than a collection of independent tree structures. The concept of overlapping tree 
structures is simple to understand and implement. Moreover, when the objects that 
have changed their location in space are relatively few, then this approach is very 
space efficient. However, if the number of moving objects from one time instant to 
another is large, this approach degenerates to independent tree structures, since no 
common paths are likely to be found. Figure 3 illustrates an example of overlapping 
trees for two different time instants and ty The dotted lines represent links to 
common paths / subpaths. 




Fig. 3. Overlapping trees for two different time instants and ty 

Among the aforementioned proposals, the 3D R-tree has been implemented and 
experimentally tested [23] using synthetic (uniform) datasets. The retrieval cost for 
several pure temporal, pure spatial and spatiotemporal operators was measured and 
appropriate guidelines were extracted. Recently, Nascimento et al. [14] have 
compared the HR-tree with the 3D R-tree and another structure, called 2-1-3 R-tree, 
using two R-trees and a rationale similar to the 2R approach presented in [10]. The 
basic conclusion is that the HR-tree is far more efficient in terms of query processing 
for time point queries while that is not true for time interval queries. Also, the HR-tree 
usually results to a rather large structure. 



3. A Set of Operations and Parameters 

Theodoridis et al. [22] have discussed a list of specifications to be considered when 
designing and evaluating efficient STAMs with respect to: (i) data types and datasets 
supported, (ii) issues on index construction, and (iii) issues on query processing. 
While the second and third ones mainly address the internal structure of a method and 
hence should be considered by STAM designers, the first group of specifications 
highly affect the design of an efficient benchmarking environment since they focus on 
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database characteristics for evaluation purposes. In particular, the specifications that 
are addressed in [22] with respect to type (i) are the following: 

- Spec 1: on the data type(s) supported. Appropriate STAMs could support either 
point or non-point spatial objects. In some cases, point objects could be considered 
as special cases of non-point objects but this depends on the underlying modeling. 

- Spec 2: on the time dimension(s) supported. A second classification concerns the 
time dimension(s) supported, i.e., valid and/or transaction time. Since at least one 
time dimension should be supported, spatiotemporal databases are classified in 
valid-time, transaction- time, and bitemporal ones. 

- Spec 3: on the dataset mobility. Three cases are addressed, with respect to the 
motion of objects and the cardinality of the dataset through time, namely evolving 
(i.e., moving objects of a fixed cardinality through time), growing (i.e., static 
objects of varying cardinality through time), and full-dynamic (i.e., moving objects 
of varying cardinality through time) databases. 

- Spec 4: on the timestamp features. Whether future instances could refer to past 
timestamps or not leads to a distinction between chronological and dynamic 
databases, i.e., collections of objects’ instances {o_id, 5,, t) that either have or not 
to obey the so-called rule of consecutive timestamps', > /. 

In the rest of the paper we study the case of temporally degenerate databases that 
obey the rule of consecutive timestamps, i.e., for each object in the database, the 
following inequality exists between the timestamp of the current instance f and that of 
the next instance to be inserted into the database: > f . The term degenerate 

refers to the characteristic that the valid time of object instances is identical to their 
transaction time. That is, an object is valid as loi^ as it exists in the database. The 
problem that arises when no such constraint existl^ is clarified through the following 
example: Consider that two instances (o_id, 5,, t) and (o_id, s., t ) of an object o have 
been inserted into the database (without a loss of generality, we assume that t. < t ) and 
no instance {o_id, s^, t) exists, such that t. < t^ < t.. Hence [f. , t) is the valid (and 
transaction) time of instance i. Let now assume that a new instance ip_id, s^, tf is 
inserted into the database, such that t. < L < i- Due to that action, (a) the valid time of 
instance i has to be changed from [t. , t ) to [t. , tf and (b) the validity interval of the 
new instance k has to be set to , f ). No straightforward support for those operations 
exists in current STAMs and, therefore, we currently postpone the study of that case. 
Note however, that this assumption does not hold in the area of bitemporal databases 
[17]. Indeed in bitemporal access structures the rule is that, by definition, only 
transaction time is monotonically increasing. However, as already mentioned, adding 
spatial features to bitemporal data is still an open area for research. 



3.1. User Requirements 

Three (Sped to Speed) out of the above four specifications are orthogonal to each 
other. On the other hand, only the chronological case of Speed is supported in this 
study, as declared earlier, and, as a result of that, we currently treat transaction- and 



^ Applicable to valid-time only since transaction-time always obeys that rule. 
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valid- time under a uniform platform. Hence, we distinguish among 12 different 
database families (e.g., a point plus transaction-time plus evolving plus chronological 
database) according to the following options: 

- Sped : point vs. region database, 

- Sped', transaction- (or valid-) vs. bitemporal database, 

- Spec3: evolving vs. growing vs. full-dynamic database, 

- Speed', chronological database. 

In order for the user of a benchmarking environment to generate a synthetic 
dataset, he/she should be able to (a) select one among the above database options and, 
then (b) tune the cardinality of the dataset and an appropriate set of parameters and 
distributions. 

A fundamental issue on generating synthetic spatiotemporal datasets is the 
definition of a complete set of parameters that control the evolution of spatial objects. 
Towards this goal, we first address the following three operations: 

- duration of an object instance, which involves change of timestamps between 
consecutive instances, 

- shift of an object, which involves change of spatial location (in terms of center 
point shift), and 

- resizing of an object, involves change of an object’s size (only applicable to non- 
point objects). 

In a more general case, the latter one could be regarded as reshaping of an object, 
as not only size but also shape could change. However, as the MBR is the most 
common approximation used by indices, we only consider that case, and thus shape 
changes are not an issue. 

A description of each operation follows. In particular, the goal to be reached is the 
calculation of the consecutive instances {o_id, s., t) of an object o (recall the 
definition in Section 1) starting from an initial instance (o_id, tf. We also assume 
that the spatial workspace of interest is the unit square [0,1)^ and time varies from 0 to 
1 (i.e., the unit interval). For illustration reasons, in Figure 4 we visualize four 
instances of a time-evolving two-dimensional region object o and the corresponding 
projections on the spatial plane and the temporal axis, respectively. 



3.2. Parameters Involved 

The shift, the duration, and the resizing of an object’s instance are represented by the 
functions: 

- duration (o_id, interval , cur r_t stamp , new_t stamp) 

- shift (o_id, Ac [] , curr_sstamp_c, new_sstamp_c) 
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t 






Fig. 4. Consecutive instances of a time-evolving object and the corresponding projections 

which calculate the new timestamp and the new spacestamp’s center and extent, 
called new_tstamp (a numeric value), new_sstamp_c (a 2-dimensional point), and 
new_sstamp_ext [] (an array of 2 intervals), respectively, of an object Identified by 
its o_id, as the sums of the respective current values and the respective parameters, 
namely interval, Ac [] , and Aext [] . 

In summary, Table 1 lists the parameters of interest and their corresponding 
domains. All parameters should follow a (user-defined) distribution, such as the ones 
we discuss in the following subsection. 



Table 1: Parameters for generating time-evolving objects 



Parameter 


Type 


Domain 


Interval 


number 


(0.. 


■ 1) 


Ac[] 


2-dimensional vector 


(-1 ■ 


-1)^ 


Aext [] 


2-dimensional vector 


(-1 ■ 


..1)^ 



3.3. Distributions 

A benchmarking environment should support a wide set of well-established initial 
data distributions. Figure 5 illustrates three popular two-dimensional initial 
distributions, namely uniform, gaussian, and skewed. In addition to the initial spatial 
distributions, there exist several other parameters requiring some kind of statistical 
distribution, especially those mentioned above (interval, Ac [] , and Aext [] ). 
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(a) uniform 





(c) skewed 



Fig. 5. Basic statistical distributions in two-dimensional space 

Through careful use of possibly different distributions for the above parameters 
one may simulate several interesting scenarios, for instance, using a random 
distribution for Ac [i] as well as for interval, all objects would move equally fast 
(or slow) and uniformly on the map; whereas using a skewed distribution for 
interval one would obtain a relatively large number of slow objects moving 
randomly, and so on. Also, by properly adjusting the distributions for each Ac [ i ] , 
one may control the direction of the objects motion. For instance, by setting Ac [i] = 
Uniform(0,l) Vi, one would obtain a scenario where the set of objects eventually 
converge to the upper-right corner of the unit workspace, irrespectively from the 
initial distributions, but using the “adjustment” approach (see subsection 4.1). 
Similarly, if one likes the objects to move towards some specific direction (e.g.. East), 
he/she can adjust Ac and put lower and upper bounds for the center’s generated 
value, as will be discussed in detail in the following section. 

Among the supported distributions, which are illustrated in Figure 5, the uniform 
distribution only requires minimum and maximum values while the other ones require 
extra parameters to be tuned by the user. In particular, the gaussian distribution needs 
mean and variance parameters as input and the skewed distribution needs a parameter 
to be declared, which controls the ‘skewedness’ of the distribution. 

In the following section, we adopt the issues discussed earlier in order to present an 
algorithm that generates synthetic spatiotemporal datasets for benchmarking purposes. 



4. The GSTD Algorithm 



We propose an algorithm, called Generate_Spatio_Temporal_Data (GSTD), for 
generating time-evolving (i.e., moving) point or rectangular objects. For each object 
o_id, GSTD generates tuples of the format (o_id, t, p,, pj, where t is the instance’s 
timestamp and p, (pj is the lower (upper) coordinate point of the instance’s 
spacestamp. The GSTD algorithm is listed in the Appendix. 
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4.1. Description of the Algorithm 

GSTD gets several user-defined parameters as input: 

- N and D correspond to the initial cardinality and density (i.e., the ratio of the sum of 
the areas of data rectangles over the workspace area) of the dataset, 

- starting_id corresponds to the initial identification number of every object in 
the dataset, 

- numsnapshots corresponds to the time resolution of the workspace, 

- min_t and max_t correspond to the domain of the interval parameter, 

- min_c[] andmax_c[] correspond to the domain of the Ac [ ] parameter, 

- min_ext [ ] and max_ext [ ] correspond to the domain of the Aext [ ] parameter, 
and generates several tuples for each object, according to the following procedure: 
'"each object is initially active and, for each one, new instances are generated as long 
as timestamp t < 1; when all objects become inactive, the algorithm ends”. 

During the initialization phase (lines 01-04), all objects’ instances are initialized, 
such that their center points are randomly distributed in the workspace, based on the 
distr_init distribution, and their extensions are either set to zero (in case of point 
datasets) or calculated according to extent (N,DJ routine with respect to the input N 
and D parameters (in case of non-point datasets).“ 

During the main loop phase (lines 06-27), each new instance of an object is 
generated as a function of the existing one and the three parameters (interval, 
Ac[], and Aext[]). Then, invalid instances (i.e., those with coordinates located 
outside the predefined workspace) can be manipulated in three alternative ways as 
described below. In order for a new instance to be generated, the interval, Ac [] , 
and Aext [] values are calculated by calling an RNG(distr, min, max) routine, 
i.e., a random number generator that generates random numbers between min and 
max following a predefined distr, which is a statistical distribution, such as the ones 
discussed in subsection 3.3. 

Then, the output function checks whether the current instance of an object has a 
timestamp value greater than or equal to the value in next_snapshot. If so, the 
coordinates of the instance (given by the old_instance variable) before the current 
instance are printed, using the appropriate timestamp (which depends on the 
next_snapshot variable). In addition, the value of the next_snapshot variable is 
properly adjusted. Otherwise, the current instance is not output. 

Obviously, it is possible that a coordinate may fall outside the workspace; GSTD 
manipulates invalid instances according to one among three alternative approaches: 

- the ‘radar’ approach, where coordinates remain unchanged, although falling 
beyond the workspace, 

- the ‘adjustment’ approach, where coordinates are adjusted (according to linear 
interpolation) to fit the workspace, and 

- the ‘toroid’ approach, where the workspace is assumed to be toroidal, as such once 
an object traverses one edge of the workspace, it enters back in the ‘opposite’ edge. 



In other words, an appropriate k = extent (N, D) value is set to achieve an initial density 
D of the dataset with respect to initial cardinality N. 
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In the first case, the output instance is appropriately flagged to denote its invalidity 
but the next generated instance is based on that. On the other hand, in the other two 
cases, it is the modified instance that is stored in the resulting data file and used for 
the generation of the next one. Notice that in the ‘radar’ approach, the number of 
objects present at each time instance may vary. 

The three alternative approaches are illustrated in Figure 6 for the example of 
Figure 4. For simplicity, only the centers are illustrated; black (grey) locations 
represent valid (invalid) instances. In the example of Figure 6a, the ‘radar fails to 
detect ^ 3 , hence it is not stored but the next location is based on that. Unlike ‘radar , 
the other two approaches calculate a valid instance s^’ to be stored in the data file 
which, in turn, is used by GSTD for the generation of s^. It is interesting to watch the 
behavior of in Figure 6c, where the calculated location finally stored (^^’ ) is actually 
identical to that in Figure 6a, as the effect of two consecutive calculations for s^’ and 
N ■ 





P ^4 




Fig. 6. GSTD manipulation of invalid instances 



4.2. Examples of Generated Datasets 

As mentioned earlier, real world examples of (point or region) spatiotemporal datasets 
include trajectories of humans, animals, or vehicles, as, for instance, detected by a 
global positioning system (GPS), digital simulations of flights or battles, weather 
forecast and monitoring of fire or hurricane fronts. For example, detecting vehicle 
motion by GPS and storing the whole trajectory in a database is a typical every day 
life example. However, different motion scenarios correspond to different datasets, 
which an efficient structure should be evaluated on. Random versus biased direction, 
fast versus slow motion are some of the parameters that result to completely different 
applications. 
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In order to simulate some of those scenarios, in this subsection we present six 
example datasets consisting of point or rectangle objects generated by GSTD. For all 
files the following parameters were set: N = 1000, D = 0 or 0.5 (for points or 
rectangles, respectively), numsnapshots = 100. Illustrated snapshots correspond to t 
= 0, 0.25, 0.50, 0.75, and 1. Figure 7 presents the non-fixed input parameters and the 
generated snapshots for each file. Scenarios 1 and 2 follow the "toroid’ and ‘radar 
approach, respectively, to manipulate invalid instances, while scenarios 3 through 6 
follow the ‘adjustment’ approach. 

Scenarios 1 and 2 illustrate points with initial gaussian distribution moving towards 
East and NorthEast, respectively. In the former case, where the toroidal world model 
was used, when the points traverse the right edge, they enter back in the left side of 
the map. Notice that to force the points moving to the East, Ac [y] = 0 and Ac [x] > 
0. In the latter case, where the ‘radar approach is simulated, the points move towards 
NorthEast and some of them fall beyond the upper-right corner (some quite early due 
to their speed), in fact some points move beyond the map. Notice that since Ac [] > 0 
always, those points will never reappear in the map. 

Scenario 3 illustrates the initially skewed distribution of points and the movement 
towards NorthEast. As the ‘adjustment’ approach was used, the points concentrate 
around the upper-right corner. Scenario 4 includes rectangles initially located around 
the middle point of the workspace, which are moving and resizing randomly. The 
randomness of shift and resizing is guaranteed by the uniform distribution 
U (min, max) used for Ac [] and Aext [] , where |min| = |max| > 0. 

Finally, scenarios 5 and 6 exploit the speed of a moving dataset as a function of the 
GSTD input parameters. By increasing (in absolute values) the min and max values of 
Ac [ ] , a user can achieve ‘faster’ objects while the same behavior could be achieved 
by decreasing the max_t value that affects interval. Thus, the speed of the dataset 
is considered to be a meta-information since it could be derived by the knowledge of 
the primitive parameters. Similarly, the direction of the dataset can be controlled, as 
presented in scenarios 1 through 3. 

Alternatively, if the user’s application makes necessary the conjunction of two (or 
more) scenarios, as for instance, a population of MBRs with only a small percentage 
of them moving towards some direction and the rest ones being static, two individual 
scenarios can be generated according to the above by properly setting the two 
starting_id input parameters and then merged, which is a straightforward task. 
Bottomline, by properly adjusting the parameters of Table 1, users can yield a wide 
spectrum of scenarios fitting their needs. 



5. Discussion and Related Work 

An alternative straightforward algorithm for generating N time-evolving objects 
would include the calculation of the spacestamp of each object at each snapshot, thus 
leading to an output consisting of T = N • numsnapshots tuples. Our approach 
outperforms that since it outputs a limited number T’ of tuples (7” <K T), i.e., the 
necessary ones in order to reproduce the dataset motion. 
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Scenario 1: 

points moving 
from center to 
East C toroid’ 
approach) 




distr_init=G(0.5,0.1) , interval=G ( 0 , 0 . 5 ) , 

Ac [x] =U (0,0.3) , Ac [y] =U (0,0) , Aext [x] =Aext [y] =U (0,0) 



Scenario 2: 

points moving 
from center to 
NorthEast 
{‘radar’ 
approach) 

distr_init=G (0 . 5 , 0 . 1) , interval=G (0,0.5) , 
Ac [x] =Ac [y] =U (0,0.4) , Aext [x] =Aext [y] =U (0,0) 




Scenario 3: 












points moving 










’ • •• ' 


from Southwest 












to NorthEast 













distr_init=S ( 1 ) , interval=G (0,0.2) , 

Ac [x] =Ac [y] =U (0,0.3) , Aext [x] =Aext [y] =U (0,0) 



Scenario 4: 

rectangles 
moving (and 
resizing) 
randomly 

distr_init=G { 0 . 5 , 0 . 1 ) , interval=G (0,0.5) , 

Ac [x] =Ac [y] =U ( - 0 . 2 , 0 . 2 ) , Aext [x] =Aext [y] =U (-0.01,0.01) 




Scenario 5: 

points moving 
randomly (low 
speed) 



distr_init=G { 0 . 5 , 0 . 1 ) , interval=G (0,0.5) , 

Ac [x] =Ac [y] =U { - 0 . 2 , 0 . 2 ) , Aext [x] =Aext [y] =U (0,0) 

Scenario 6: 

points moving 
randomly (high 
speed) 



distr_init=G { 0 . 5 , 0 . 1 ) , interval=G (0,0.5) , 

Ac [x] =Ac [y] =U { - 0 . 4 , 0 . 4 ) , Aext [x] =Aext [y] =U { 0 , 0 ) 

Fig. 7. Example files generated hy GSTD 
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However, a fundamental question arises: based on the knowledge of two instances 
(o_id, s,, t) and {o_id, 5,^,, that correspond to consecutive timestamps, which is the 
location of an object at a time t , such that t. < t. < ? As an example, recall the 

instances of the object o illustrated in Figure 4. The status of its spacestamp between 
e.g., t. and t.^^ is a fuzzy issue. Among others, two alternatives may be followed: 

- projection: the spacestamp is considered to he static and equal to the one at time t,, 

- linear interpolation: the spacestamp is considered to be moving with respect to a 
start- (at time t) and an end- (at time t.^j) positions 

Both alternatives find applications in real world; cadastral systems, on the one 
hand, versus navigational systems, on the other hand, are popular examples. Figure 8 
illustrates the two alternative scenarios for the example of Figure 4. 



(0_id, ^4, ?4) 




(a) projection 




Fig. 8. Alternative scenarios for the location of an object between two timestamps 

In any case, detecting the status of object o at a time instance during (t^, tj) is an 
open issue (e.g., uncertainty may need to be captured [16]). We argue that the 
proposed GSTD algorithm is independent of that issue. Actually, it generates a series 
of instances regardless of such an issue. Unlike the data generator, it is a visualization 
tool or a STAM construction algorithm that needs to support a specific scenario. 
Since in this study we are interested in spatiotemporal databases that follow the rule 
of consecutive timestamps, the knowledge of both the current and the new instances 
of an object, as supported by GSTD (see Appendix, line 13), are sufficient to deal 
with any alternative. 

The need for independent platforms for benchmarking purposes or, in general, 
experiment management has been already addressed in the past [21, 8]. Such a need 
arises when a researcher aims to make a fair performance study or experimentation 
without the dilemma of building his/her own datasets for this purpose. Although 
extended related work is found in traditional database benchmarks and data generators 
(e.g., [2, 5]), in the field of spatial databases it is very limited [20, 6, 15]. Moreover, 



“ Linear interpolation assumes that a linear function represents boundary points’ motion, i.e., 
intermediate locations are linear to the start- and end- points. Higher-order polynomials are 
hardly modeled. 
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when motion is introduced to support spatiotemporal databases, to our knowledge, no 
related work exists. 

Relaxing this constraint, the most relevant to our work is the ‘A La Carte’ 
benchmark [6]. It is a WWW-based tool consisting of a rectangle generator that builds 
datasets based on user defined parameters (cardinality, coverage, coordinates’ 
distributions) and an experimentation module that runs experiments on either user 
built or stored sample datasets (including parts of the Sequoia 2000 storage 
benchmark [21]). The module is actually a spatial join performance evaluator that 
supports several spatial join strategies. 



6. Conclusion and Future Work 

STDBMS require appropriate indexing techniques on spatiotemporal data. Although 
conceptually the problem seems to be easy to solve, several issues arise when one 
attempts to adopt a spatial indexing method to organize time-evolving objects by just 
adding an extra dimension for time. Therefore, a limited number of STAMs have been 
proposed in the literature as briefly surveyed in Section 2. 

The effort towards the design and implementation of a benchmarking environment 
in order to provide performance comparison of STAMs leads to the need of collecting 
a variety of appropriate synthetic and real spatiotemporal datasets. However, in 
accordance to the design of efficient methods, generating efficient synthetic datasets 
is not a straightforward extension of generating spatial data, such as the ones that have 
been thoroughly used for experimental purposes in the spatial database literature. At a 
first step, several specifications that identify the type of the dataset have to be 
addressed and, at a second step, a set of parameters and corresponding distributions 
have to be tuned by the user. More specifically, we have discussed three operations, 
namely duration of an object instance, shift and resizing of an object (the latter one 
applicable to non-point objects) and derived a set of three parameters, namely 
interval, Ac, and Aext, which control the evolution of a spatial object through time 
in satisfactory terms. 

Based on those parameters, we have designed and implemented the GSTD 
algorithm that generates sets of moving points or rectangles according to users’ 
requirements, thus providing a tool that simulates a variety of possible scenarios. 
Some of those scenarios have been illustrated and discussed in Section 4. GSTD also 
includes alternative methodologies to support invalid instances, i.e., those with 
coordinates falling outside the workspace. 

This study continues the work initiated in [22] towards a full and interactive 
support tool for designing, implementing, and evaluating access methods for the 
purposes of STDBMS. We are currently working onj WWW environment to make 
GSTD available to all researchers through the Interner: 



Mirror sites: http://www.dblab.ece.ntua.gr/~theodor/GSTD/ and http://www. 
doc . unicamp . br/~mario/GSTD/. 
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We are also investigating some additional functionality on GSTD. For example, 
users may want to specify a movement flow to a specific point p in the workspace. 
Although, given the target p, it is not a complicate task, it is a specific implementation 
to a specific scenario. We currently study the parameterization of such specific 
scenarios by permitting GSTD input parameters to be (user-defined) functions rather 
than fixed values. Such an extension will enhance GSTD flexibility to simulate a 
variety of real applications. 
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Appendix: The GSTD Algorithm 

Generate_Spatio_Temporal_Data algorithm 

Input: values N, starting_id, numsnapshots , D, min_t, max_t 

arrays min_c[] , max_c[] , min_ext[] , max_ext[] 
distributions distr_init, distr_t, distr_c, distr_ext 
Output: instance (id, t, l_l_point, u_r_point) , validity_f lag 

begin 

01 for each id in range [starting_id . . N+starting_id] do 

//initialization phase 

02 Set c[] = RNG (distr_init ( ) , 0, 1), ext[] = extent (N, D) 

03 Set t = 0, active = TRUE 

04 end- for 

05 Set step = 1 / numsnapshots 

06 for each id in range [starting_id . . N+starting_id] do 

//loop phase 

07 Set next_snapshot = step 

08 while active do 

/* calculate delta-values and new instances */ 

09 Set interval = RNG (distr_t ( ) , min_t, max_t) 

10 Set Ac[] = RNG (distr_G { ) , min_c[], max_c[]) 

11 Set Aext [] = RNG (distr_ext { ) , min_ext [] , max_ext [] ) 

12 Set old_instance = instance 

13 update_instance (instance) 

/* check instances and output */ 

14 if t > 1 then 

15 active = FALSE 

16 output (old_instance , current [i], next_snapshot) 

17 else //check instance validity and output 

18 Set validity_f lag = valid (instance) 

19 if validity_f lag = FALSE and approach 'radar' then 

20 adjust_coords (instance, approach) 

21 end-if 

22 if t > next_snapshot then 

23 output (old_instance , current [i], next_snapshot) 

24 end-if 

25 end-if 

26 end-while 

27 end- for 
end. 
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Abstract. The polygon amalgamation operation computes the bound- 
ary of the union of a set of polygons. This is an important operation 
for spatial on-line analytical processing and spatial data mining, where 
polygons representing different spatial objects often need to be amalga- 
mated by varying criteria when the user wants to aggregate or reclassify 
these objects. The processing cost of this operation can be very high 
for a large number of polygons. Based on the observation that not all 
polygons to be amalgamated contribute to the boundary, we investi- 
gate in this paper efficient polygon amalgamation methods by excluding 
those internal polygons without retrieving them from the database. Two 
novel algorithms, adjacency- based and occupancy-based, are proposed. 
While both algorithms can reduce the amalgamation cost significantly, 
the occupancy-based algorithm is particularly attractive because: 1) it 
retrieves a smaller amount of data than the adjacency-based algorithm; 
2) it is based on a simple extension to a commonly used spatial indexing 
mechanism; and 3) it can handle fuzzy amalgamation. 

Keywords: spatial databases, polygon amalgamation, on-line analytical 
processing (OLAP), spatial indexing. 



1 Introduction 

Following the success and popularity of on-line analytical processing (OLAP) and 
data mining in relational databases and data warehouses, an important direc- 
tion in spatial database research is to develop spatial data warehousing, spatial 
OLAP and spatial data mining mechanisms in order to extract implicit knowl- 
edge, spatial relationships, and other interesting patterns not explicitly stored 
in spatial databases pin]. Huge amounts of spatial data have been accumu- 
lated in the last two decades by government agencies and other organizations 
for various purposes such as land information management, asset and facility 
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Fig. 1. An example of polygon amalgamation. 



management, resource management, and environment management. With the 
maturity of commercial spatial database management systems (SDBMS), it is a 
trend to migrate spatial data from proprietary file systems to an SDBMS. Thanks 
to various national spatial data initiatives and international standardization ef- 
forts, it is now both feasible and cost-effective to integrate spatial databases from 
different sources. Dramatic improvements have been made on the accessibility 
to extensive and comprehensive data sets in terms of geographical coverage, the- 
matic layers and time. With huge amounts of integrated spatial data available, 
it is an imminent task to develop powerful and efficient methods for the analysis 
and understanding of spatial data and utilize them effectively. 

For efficient OLAP and mining of spatial data, a spatial data warehouse 
needs to be built m- The cost of building a spatial data warehouse is intrinsi- 
cally higher than building a relational data warehouse. Spatial operations are 
both I/O-intensive (for retrieving large amounts of spatial objects from database) 
and CPU-intensive (for performing complex spatial operations). Design of effi- 
cient spatial indexing structures and algorithms for processing various spatial 
operations and queries have been the focus themes in spatial database research 
pmaiiDiiiiiiiiiiniiiniEiiEsiEiES Satisfactory performance has been 
achieved for many spatial database operations. However, in the process of evolv- 
ing SDMS towards spatial OLAP and spatial data mining, the performance of 
spatial data processing becomes the bottleneck again since these new applica- 
tions analyze very large amounts of complex spatial data using costly spatial 
operations. 

Among many spatial operations, we have found that special attention needs 
to be paid to one particular operation, the polygon amalgamation operation. 
Given a set of source polygons, this operation computes a new polygon (called 
the target polygon) which is the boundary of the union of the source polygons. 
Figure n (b) shows the target polygon from merging the source polygons shown 
in Figure n (a). While both intersection and union are basic operations on poly- 
gon data, the polygon amalgamation problem, unlike the polygon intersection 
problem, has received little attention so far in the context of spatial databases. 
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This operation, however, becomes a fundamental operation for emerging new 
applications such as spatial OLAP and spatial data mining. 

Consider a typical scenario in spatial OLAP. A region is partitioned into a 
set of areas (represented as polygons), where each area is described by some non- 
spatial attributes (for example, area name, time, temperature and precipitation). 
Using the spatial data warehouse model proposed in naini, a spatial data cube 
can be constructed with a spatial dimension (e.g., area) and several non-spatial 
dimensions (e.g., area name, time, temperature, precipitation). The measures 
used here can be non-spatial such as daily, weekly or monthly, or spatial such as 
combined areas according to certain concept hierarchy. Typical OLAP operations 
such as drill-down and roll-up can be applied to both spatial and non-spatial 
dimensions. A roll-up operation, such as generalizing temperature from degrees 
as recorded in the database into broader categories such as ‘cold’, ‘mild’ and ‘hot’, 
requires to merge areas according to their temperature degrees. Target polygons 
generated from such a generalization operation can be used for further operations 
(e.g., overlay with another spatial layer such as soil types). To spatial data 
mining, the polygon amalgamation operation is also important. For example, the 
user may want to group similar or closely related spatial objects into clusters, 
or to classify spatial objects according to certain feature classes (such as highly 
developed vs. poorly developed regions) |Hl O 1^- Such mining will lead to 
combining polygons into large groups for high level description or inductive 
inference, using the polygon amalgamation operation. 

The above discussion shows that an OLAP operation on a spatial dimension 
or a clustering operation on a group of spatial objects can result in new polygons 
at a high level of abstraction. Because of high processing cost associated with the 
polygon amalgamation operation, it is desirable to pre-compute target polygons 
and store them in the data cube in order to support fast on-line analysis. Obvi- 
ously, it is a trade-off between the on-line processing cost for computing target 
polygons and the storage cost for materializing them. A similar problem in re- 
lational OLAP has been investigated by several researchers (e.g., 01EI)- While 
materializing every view requires a huge amount of disk space, not materializing 
any view requires a great deal of on-the-fly and often redundant computation. 
Therefore, the cuboids (which are sub-cubes of a data cube) in a data cube are 
typically partially materialized. Even when a cuboid is chosen for materializa- 
tion, it is still unrealistic to pre-compute every possible combination of source 
polygons when there are a large number of source polygons due to a prohibitive 
amount of storage required for newly generated polygons US]. In other words, 
some polygon amalgamation tasks have to be performed on the fly. Moreover, 
certain types of multi-dimensional analysis require dynamic generation of hi- 
erarchies and dynamic computation of polygon amalgamation. In the previous 
example, different users may define different temperature ranges for ‘cold’, ‘mild’ 
and ‘hot’. It might be necessary for some spatial data mining algorithms to try 
different classification (e.g., adding two more categories ‘very cold’ and ‘warm’ 
in order to And relationships between temperature and vegetation distribution) . 
This type of analysis also demands dynamic polygon amalgamation. 
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Efficient polygon amalgamation is crucial for both building a spatial data 
warehouse and on-line processing. A straightforward method for polygon amal- 
gamation is to retrieve all source polygons from database, and then merge them 
using some computational geometry algorithm such as described in [ 22 |- Such 
a simplistic approach can be very time-consuming when the number of source 
polygons is large. We observe that there exist some internal polygons which do 
not contribute to the boundary of the target polygon. For example, if it is suf- 
ficient to use only the polygons shown in Figure □ (c) to compute the target 
polygon in Figure □ (b), a saving can be made by not to fetch and process other 
internal polygons. Savings from such an optimization can be significant as the 
number of polygons to be processed is reduced to be proportional to the perime- 
ter of the target polygon, as opposed to its surface area. Obviously, the CPU 
cost of polygon amalgamation can be reduced by processing a smaller number of 
polygons. Whether the I/O cost can also be reduced depends on if those internal 
polygons can be identified without retrieving them from the database. 

In this paper, we propose two novel methods for identifying internal polygons 
without retrieving them from the database. The first method uses the informa- 
tion about polygon adjacency, and the other takes an advantage of spatial index- 
ing. Both algorithms are highly effective in reducing CPU and I/O costs. The 
latter, however, is particularly attractive for several reasons. Comparing with 
the adjacency-based algorithm, it takes less time in identifying internal poly- 
gons. More importantly, it is based on a simple extension to a popular spatial 
indexing mechanism that is supported by many SDBMSs. Thus, this algorithm 
can be incorporated easily and efficiently into the SDBMSs supporting that 
type of indexing mechanism. Another advantage comes when there are holes in 
target polygons. For spatial OLAP and spatial data mining applications, it is 
sometimes desirable to ignore holes smaller than certain threshold. These small 
holes might be insignificant to applications, or caused by imperfect data quality 
(for example, a small area with a high temperature surrounded by areas with 
low temperatures can be a ‘noise’). We call it fuzzy amalgamation if the holes 
smaller than a specified threshold are to be ignored when merging polygons. 
Unlike the adjacency-based algorithm, the occupancy-based method can handle 
fuzzy amalgamation without incurring extra overhead for removing small holes. 

The remaining of the paper is organized as follows. In Section 2, we give a 
basic amalgamation algorithm that processes all source polygons. The adjacency- 
based and occupancy-based algorithms are discussed in Section 3. A performance 
study of these algorithms is reported in Section 4. We conclude our discussion 
in Section 5. 



2 A Simple Approach 

In this section, we give a simple amalgamation algorithm that processes all source 
polygons. This algorithm provides a reference for evaluating the performance of 
other algorithms. 
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Fig. 2. Polygon representation 



2.1 Polygon Representation 

Let V be the data space, and P — {pi\i = l..n} be a set of polygons inside V 
where pi also denotes the identifier of polygon pi . Polygon identifiers are unique 
in P. The boundary of a polygon may be disconnected. For example, the State 
of New South Wales (NSW) encloses the Australian Capital Territory (ACT), 
and ACT consists of two disconnected areas. We call a connected component 
of a polygon a ring. Thus, ACT is defined by two rings, and NSW is defined 
by three rings (one for its outer boundary and two for excluding ACT). In this 
paper we assume that a ring does not intersect with itself. We also assume that 
polygons in P do not overlap with each other. We use t{S) to denote the target 
polygon amalgamated from a set S' C P of source polygons. A polygon p is a 
boundary polygon of S if it shares its boundary with that of t(S). All boundary 
polygons of S are denoted as dS. The polygons in (S — dS) (i.e., in S but not 
in dS) are internal polygons. 

A point is represented by its coordinates (x, y). A line segment I is represented 
by its start and end points {l.s, I.e). A polygon is represented as a sequence of 
points. For a polygon with k points Vi ■ ■ ■ Vk, its boundary is defined by fc + 1 
connected line segments {v\, V 2 ), ■■■ {vk-i, Vk), (vk, 'Ci)- After a polygon is 
fetched from the database, it is unfolded into the form of line segments following, 
for example, the clockwise order (as in Figure 0 ). In other words, we view a 
polygon as a sequence of line segments in this paper. Thus, the polygons in 
Figure El (a) are represented as: 

Pi = {{a,b),{b,c),{c,d),{d,a)), 

P2 = ((e, /), (/, g), {g, h), {h, e)), 

P3 = ((e,z),(z,j),(j,A:),(fc,e)). 

All the line segments in a polygon p are said to be in S' if p G S'. We use |S| and 
||S|| to denote the number of polygons and the number of line segments in S. 

Among many possible relationships between two line segments I and V on T>, 
we are interested in two relationships which are informally defined as: 

1. congruent. I and V are congruent if they are between the same pair of points 
(i.e., l.s = V .s and I.e = V .e, or l.s = V .e and I.e = V .s). In this case we also 
say these two line segments are identical. 
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2. touching: I touches V if there exists one and only one common point v (termed 
the joint point) between I and V , such that v is an end point of I (i.e., v = l.s 
or V = I.e) and v is not an end point of V (i.e., v ^ V .e and v yf I'.s). For 
example, (5, c) touches (e, /) at point c in Figure |3 (a). 

The boundary of t{S) consists of some existing line segments in S, and possibly 
some new line segments each of which is defined with at least one joint point. 
The polygon in Figure |2I (b) consists of a set of existing line segments in S' = 
{pi,P2,P3} in Figure0(a), plus three new line segments {c,f), (h,i) and (fc, d) 
where c, h, and d are joint points. For any algorithm to compute a target polygon, 
it is necessary to find at least the joint points that define the target polygon. 
In order to avoid costly operations of splitting lines at joint points, we apply a 
pre-processing step such that whenever a line segment I of polygon p touches I' 
of polygon p' at point v, V in p' is replaced by two new line segments (F.s, v) and 
(u, V .e). Note that how lines split here depends only on data set P, regardless 
which subset of P is to be amalgamated. Thus, this operation can be done at 
the time of building spatial data cubes on P. After such a pre-processing step, 
the polygons in Figure 0 are represented in the database as 

Pi = {{a, b), {b, c), (c, e), (e, d), (d, a)}, 

P2 = ((e, c), (c, /), (/, g), (p, h), {h, e)), 

P3 = {{e,h),{h,i),{i,j),{j,k),{k,d),{d,e)). 



2.2 Removing Identical Line Segments 

Under the above assumptions, it is clear that there are no identical line segments 
in a polygon. Further, a line segment I is on the boundary of t{S) if and only if 
it has no identical line segments in S. Therefore, a straightforward algorithm to 
amalgamate polygons is to remove all identical line segments in S. Below is a 
sketch of such an algorithm: 

Algorithm SIMPLE 

Given a set of source polygons S, find t{S). 

1. {Retrieve data) fetch all the polygons in S into a set L of line segments, with 
the middle point of each line segment calculated. 

2. {Remove identical line segments) sort line segments by their middle points 
(by X then y), and remove all line segments whose middle points appear 
more than once in L. 

3. {Finish) return the remaining line segments in L as t{S). 

Since we assume that there are no overlapping polygons or self-intersecting 
rings in source data, in this algorithm we use the middle point to represent 
a line segment for identification of identical line segments. It is simpler and 
more efficient to process points than line segments. An additional advantage of 
representing a line segment by a point is that it becomes possible to apply a hash 
function to avoid sorting all line segments together m- That is, space T> can be 
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divided into cells, and a data bucket is associated with each cell. A line segment 
is mapped into the bucket whose corresponding cell contains the middle point of 
the line segment. After line segments are mapped into buckets, all identical line 
segments must be inside the same bucket. Thus, it is sufficient to sort the line 
segments bucket by bucket, instead of sorting them all together. This hashing- 
based method is particularly useful when the memory is not large enough for 
storing all line segments of S as it is now possible to apply those well-known 
methods developed in relational databases to handle similar problems (such as 
the hybrid join algorithm ) . 

3 Identifying Boundary Polygons 

In this section we investigate two methods of identifying a subset S C S such that 
t{S) = t{S). The performance of these algorithms will be discussed in Section 4. 

3.1 Using Adjacency Information 

Two identical line segments must come from two adjacent polygons. Strictly 
speaking, polygons can be adjacent to each other by edge or by point. There 
is no need to consider the latter because our interest here is to identify iden- 
tical line segments. Using the data structures discussed in Section 2, one can 
simply define that two polygons are adjacent to each other if they have at least 
one pair of identical line segments. Moreover, the adjacency table of a set P 
of polygons is defined as a two column table ADJACENCY(p, p') where p, p' are 
identifiers of polygons in P. A tuple (p, p') is in table ADJACENCY if and only 
if polygon p is adjacent to polygon p' . The adjacency relation is reflective. For 
two adjacent polygons p and p' , one can record the adjacency information re- 
dundantly (i.e., recording both (p, p') and (p', p)). Alternatively, a total order 
among polygon identifiers can be imposed (e.g., the alphabetical order if the 
identifiers are character strings) such that (p, p') in the adjacency table only 
if p < p' ■ As to be discussed in Section 4, this decision is an implementation 
issue, which has implications on the efficiency of the database queries to identify 
boundary polygons. For presentation simplicity, we assume the redundancy ap- 
proach hereafter. Below is the adjacency table for the four polygons in Figure 0 
(a) where P = {pi,P 2 ,P 3 ,P 4 }: 

{{Pl,P2), (Pl,P3), (Pl,P4), (P2,P3), (P2,P4), {P3,Pi), {P2,P), (Ps,^?), (P4,2?)} 

where (p, T>) means p has at least one line segment adjacent to no P polygons 
(in this case we say p is adjacent to a dummy polygon also labeled as T>). For 
S' C P, by definition we have 

dS = {p\p G S,3(p,p') G ADJACENCY, p' ^ S} 

That is, a polygon is on the boundary of t{S) if and only if it is inside S and it 
has at least one adjacent polygon which is not in S. Using the adjacency table, 
dS can be easily identified using SQL queries. 
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Fig. 3. An example of shadow ring 



Unfortunately, it is not sufficient to use only dS to produce t{S). When 
dS C S (i.e., a true subset), two rings can be produced by removing all identical 
line segments in dS. For example, when merging the four polygons in Figure 0 (a) 
using dS = {P 2 ,P 3 ,P 4 :}, two rings are produced as shown in Figure El (b) where 
the inner ring is actually not part of t(S). We call such an unwanted ring a 
shadow ring. The reason of producing shadow rings is simple: some line segments 
which would be otherwise found as identical cannot be recognized because their 
counterpart line segments are from polygons not in dS. In general, as a result 
of using only a subset of polygons, for each ring r which is part of t{S) (called 
a boundary ring), there may exist a corresponding shadow ring r\ Note that 
r can either enclose or is enclosed by its shadow ring r' (the latter happens 
when r defines a hole of the target polygon) . A shadow ring can be a boundary 
ring at the same time (e.g., when S = {p2,P3^Pa\ in Figure E|). The following 
example illustrates that it is not possible to tell whether a ring is a shadow ring 
by only looking at dS. While the inner ring in Figure El (b) is a shadow ring 
when S = {_Pi,P2,P3,P4}, it defines a hole in t{S) when S = {p2,P3iPa\- In 
order to identify possible shadow rings, we need to use a supplementary data 
set dS^ C S where dS^ contains those S polygons adjacent to but not in dS 
polygons. That is, 

dS+ = {p\p G (S' - dS),3p' G dS, {p,p') G ADJACENCY} 

We call the polygons in dS and dS'^ the boundary polygons and the sub-boundary 
polygons respectively. For a line segment I G dS, if there exists a line segment 
I' G S and I' is identical to I, I' must be in either dS or dS~^. In other words, no dS 
line segments can form a shadow ring if dS polygons are processed together with 
SS+ polygons. Of course, after removing identical line segments in dS U 5S+, 
all line segments from 9S+ need to be discarded. 

Algorithm ADJACENCY 

Given the adjacency table ADJACENCY for a set P of polygons and S C P, find 
t(S). 

1. {Find dS) dS = 0 ; for each p G S, add p to dS if {p,p') G ADJACENCY and 
P'^S. 
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2. {Find dS^) dS^ = 0; for each p G dS, add p' G {S — dS) to dS'^ if {p,p') G 
ADJACENCY. 

3. {Retrieve data) retrieve all polygons in S' = dS U SS"*" into a set L of line 
segments, and mark the line segments from dS~^ as ‘auxiliary’. 

4. {Remove line segments) remove identical line segments in L (as in Algorithm 
SIMPLE). 

5. {Remove dS~^ line segments) remove all the ‘auxiliary’ line segments from 
L. 

6. {Finish) return the remaining line segments in L as t{S). 

Algorithm ADJACENCY computes t{S) from S = dS U dS^ , where dS and 
dS'^ are found using the adjacency table which contains no spatial data. That is, 
the internal polygons can be excluded from further processing without fetching 
and examining their spatial descriptions. It is not necessary to fetch all polygon 
identifiers for the database, as the adjacency table is a simple relational table 
and the first two steps in algorithm ADJACENCY can benefit from indices on 
both columns of the table. Note that the adjacency table only needs to be built 
once, independent of S. However, dS and dS'^ have to be built when S is given. 

Like the filter- and-re fine approach which is a standard approach in spatial 
data processing [2]ElEnj, algorithm ADJACENCY uses the adjacency table to 
perform a non-spatial filtering step to reduce the number of spatial objects to 
be processed. Because of the extra cost in constructing dS and dS~^, Algorithm 
ADJACENCY is efficient only when [S'! << [S'!. As we will show in Section 4, 
even though \dS~^\ can be many times larger than |i9S'|, this filtering step is still 
very effective in improving the overall performance. 



3.2 Using Occupancy Information 

The adjacency-based approach above can be described as object-centric as it 
focuses on identifying boundary polygons. Now we propose a space-centric ap- 
proach, which decomposes space T> into small irregular regions and identifies a 
set of boundary regions. 



Z- Values All SDBMSs support one or several spatial data access methods for 
fast retrieval of spatial objects. Spatial access methods have received extensive 
attention in the past from the spatial database research community (see jl l)j for 
a survey) . Represented by the spatial indexing mechanisms based on the R-trees 
ra. R+-trees 121 ! and the z- values | 2 |, a spatial data access method establishes 
certain relationship between the data space and spatial objects or their approxi- 
mations such as minimum bounding rectangles. The z-ordering technique is one 
of the most widely used spatial indexing mechanisms 0EQII2S1. It approximates 
a given object’s shape by recursively decomposing the embedding data space into 
smaller sub-spaces known as Peano cells. The z-ordering decomposition works 
as follows. The whole space T> (represented as a rectangle) is divided into four 
smaller rectangles of the same size. The four quadrants are numbered as 1 to 4 
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Fig. 4. Z-order and object approximation using z-values 



following certain order (e.g., the z-order as shown in Figure^. These quadrants 
can be further divided and numbered recursively. In such a way, T> can be decom- 
posed into a set of quadrants, or Peano cells, of varying sizes. Each Peano cell 
has a z-value, which can be determined elegantly by bit-interleaving [221 • The 
following is a simple description of one way to assign z-values (see Figure 0): 

1. The z-value of the initial space, T>, is 1; 

2. The z-values of the four quadrants of a Peano cell whose z-value is z = zi...z„, 
^ Zi < 4, 1 < i < n, are zl, z2, z3 and z4 respectively following the z-order; 

Thus, the z-values of the minimum Peano cells containing polygon pi and p 2 in 
Figure El are 14 and 1221 respectively. The z-values can have different length, 
reflecting different Peano cell sizes. The maximum number of decomposition 
level, also called resolution, determines the maximum length of z-values. In order 
to simplify processing, a number of ‘O’s are often appended at the end of shorter 
z-values to make the length of all z-values identical (i.e., the maximum length). 

Spatial objects can be approximated using a set of Peano cells. A polygon 
can be assigned with the z-value of the minimum Peano cell which fully encloses 
the polygon (so the z-values of polygons pi and p 2 in Figure El are 14 and 1221 
respectively) . The accuracy of approximation can be improved by assigning mul- 
tiple z-values to a polygon (e.g., pi in Figure El can be approximated by three 
Peano cells whose z-values are 141, 142 and 143). The issue of approximating 
polygons by multiple z-values is a subject of several previous studies EJE|. 
From a mathematical viewpoint, this decomposition is a transformation of a two- 
(or higher) dimensional object into a set of one-dimensional points (i.e., the z- 
values) which can be represented as numbers and therefore can be maintained 
by a ubiquitous one-dimensional access method such as the B+-tree PJ. 

Let c and c' be two Peano cells. If c is nested inside c' and the z-value of c' 
has k non-zero digits, then c must have at least k non-zero digits, and they are 
digit-wise identical to the first k digits of the z-value of c. In other words, the 
containment relationship among Peano cells can be easily recognized by looking 
at their z-values. This property has a wide range of applications. For example, 
for selecting objects inside a given region, one can first And a set Z of z-values 
of the Peano cells covering the query region. Subsequently, the objects inside 
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the query region can be quickly identified, using a B+-tree index on z-values for 
example, by comparing their z-values with the z-values in Z . 



Z- Values with Occupancy Let C be the set of all Peano cells with which S 
polygons overlap with. If c G (7 is not completely occupied by S polygons, we 
call c a boundary cell. An S polygon overlapping with a boundary cell is likely 
to be a boundary polygon. Now we look at how to use z-values to find boundary 
cells by extending the traditional z-value based spatial indices. 

The spatial indices using z-values associate objects with Peano cells. That is, 
each index entry is of the form (z, p), stating that object p overlaps with Peano 
cell z. There is no information about what percentage of the cell is occupied by 
p, thus it is not sufficient to determine if a Peano cell is completely occupied by 
a set of objects. Therefore, we extend the index entry to the form of (z, p, a) 
where a is the occupancy ratio. Let p fl z be the polygon produced from clipping 
polygon p by the Peano cell z, then 

areaip D z) 

“ = 

area[z) 

In other words, we record not only which polygons overlap with a Peano cell, but 
also the percentage of the area that each polygon occupies. The structure of the 
spatial indexing mechanisms based on traditional z-values, and the algorithms 
using such indices, need little modification to accommodate this additional piece 
of data, though some more efficient algorithms can be designed to take advantage 
of the occupancy information (e.g., the spatial join algorithm in (2SI). 



Identifying Boundary Peano Cells With the occupancy ratio for a polygon 
in a Peano cell, a boundary cell with respect to S can be identified simply by 
adding up the occupancy ratios of all S polygons overlapping with the cell. On 
the other side, we want to ensure that all S polygons which do not overlap with 
any boundary cells are internal polygons thus can be ignored by our amalgama- 
tion algorithm. While this is in general true, there is an exception when polygon 
p has a line segment I which coincides with the boundary of a Peano cell. One 
extreme case is a polygon that is of the same shape and size with a Peano cell. 
To solve this problem, we introduce zero-occupancy. That is, we use (p, z, 0) 
if p is adjacent to but not inside cell z. In such a way, if p is a boundary cell 
but cannot be recognized by other cells, it can be picked up by cell z if z is a 
boundary cell. Now the problem of finding boundary polygons of S can be solved 
by finding all Peano cells which are not fully occupied by S polygons. 

The algorithm for finding boundary cells can be complex because Peano cells 
are typically of different sizes, and some Peano cells may be nested inside others. 
Assume that c is nested inside d . If c is an internal cell (i.e., fully occupied by S 
polygons), this fact needs to be propagated to its parent cell c! in order to find if d 
is also fully occupied. This upwards propagation can be done using an algorithm 
of controlled-traversal similar to the one used in EQj. However, if a parent cell d 
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is not fully occupied but one of its sub-cell c might be, it is difficult to translate 
the occupancy ratio from d to c. This translation requires polygon clipping and 
re-calculation of the occupancy ratios in c for the polygons approximated in d . 
In addition to a more complex algorithm to identify boundary cells, allowing 
nesting cells may lead to another disadvantage. That is, it is no longer possible 
to use simple SQL queries to find boundary cells; rather, all spatial index entries 
need to be pulled out and processed outside of the underlying DBMS. 

On the other side, it is not efficient if there are too many polygons asso- 
ciated with a Peano cell. All source polygons associated with a boundary cell 
will be processed as possible boundary polygons. Thus, a number (termed high 
watermark (HWM)) is chosen such that a cell is to be further decomposed into 
quadrants when the number of polygons overlapping with the cell exceeding the 
high watermark. To avoid nesting cells, once a cell is decomposed all polygons 
approximated in that cell will be re-approximated at the lower level. In other 
words, while cells in the spatial index can be different sizes, there are no nesting 
cells. 

Algorithm OCCUPANCY 

Given a set S' C P of source polygons and a z-value-with-occupancy index / = 
{ci, . . . e„} for P where each is of the form (z, p, a), find t(S). 

1. {Identify boundary cells) select Z = {ei.z\ci G I, Ci.p G S, X)ee/ e z=e z ^ 

100 %}. 

2. {Identify S) S = 0; add p to S if {z, p, do-not-care) G I and z G Z. 

3. {Fetch S polygons and do line- clipping) fetch polygons whose id in S, and 
add the line segments which intersect with at least one cell in Z into L. 

4. {Remove duplicate line segments) remove identical line segments in L (as in 
Algorithm SIMPLE). 

5. {Finish) return L as t(S); 

Algorithm OCCUPANCY identifies boundary Peano cells and computes the 
target polygon using the occupancy information. Note that the first two steps 
in this algorithm can be implemented using a single SQL query with a group-by 
clause (i.e., “group by z having sum{a) < 100%”). Because a boundary cell tells 
not only which polygons to be fetched but also which parts of these polygons are 
to be used (that is, a line segment I of polygon p which overlaps with a boundary 
cell z contributes to the part of t{S) in that cell only if I intersects with z). Based 
on this property, Algorithm OCCUPANCY can discard some line segments in 
step 2 and subsequently improve the performance of step 3. 

In algorithm OCCUPANCY S' is a superset of dS because an internal polygon 
can also overlap with a boundary cell. In general, it is likely to have more internal 
polygons in S if the Peano cells are large. This performance issue will be discussed 
in Section 4. Because of these internal polygons, on the other side, algorithm 
OCCUPANCY does not have the problem of producing shadow rings. One can 
see this from two aspects: 

1 . Any line segment which is not in any boundary cell is not part of the target 
polygon, thus can be discarded; 
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2. Any line segment inside a boundary cell either is part of the target polygon, 
or its counterpart from another polygon must also be fetched as this polygon 
also overlaps with the boundary cell. 



Fuzzy Amalgamation Algorithm OCCUPANCY has a desirable advantage 
over the other two algorithms — the definition of boundary cells (thus bound- 
ary polygons) can be easily adjusted by the user. Instead of defining the internal 
cells as those with 100% aggregate occupancy, one can adjust to a lower thresh- 
old (for example, 95%). This is useful when the user wants to ignore data noises 
in polygon amalgamation (e.g., holes smaller than certain size, caused by ei- 
ther some abnormal or insignificant attribute values, or caused by inaccuracy 
in polygons definitions known as sliver polygons). If this threshold percentage 
is defined relatively to a Peano cell, one can simply replace 100% in algorithm 
OCCUPANCY with the threshold. In general, however, the user may want to 
use a threshold related to the data space V. In this case one need to translate 
this threshold value to each Peano cell by considering actually the size of the cell. 
Let d = length{z) be the number of non-zero digits of z- value 2 . The area of this 
cell is 1/2^^ of that of V. Thus, a threshold of t percent of the total data space is 
equivalent to t x 2^^ percent in cell z. Algorithm OCCUPANCY can skip these 
insignificant holes easily and safely, with little extra overhead. This type of fuzzy 
amalgamation, often found as useful for spatial OLAP applications, cannot be 
achieved by the other two algorithms without forming those small polygons and 
calculating their sizes. 



4 Performance Study 

In this section we compare the three polygon amalgamation algorithms discussed 
in this paper. The primary performance index used in this section is the response 
time, which is measured as the elapsed time from when a predicate describing the 
source polygons is submitted to all the line segments of the target polygon are 
found. As pointed out in HOI, the I/O cost-based measure such as the number of 
disk pages accessed is not necessarily a suitable performance indicator because 
the CPU cost can be equally important. Not measured are those once-off costs, 
including pre-processing of original polygons (to comply with the polygon data 
structures in Section 2), building the adjacency and occupancy tables and other 
necessary indices. 



4.1 Cost Analysis 

From a set S' C P of source polygons, an amalgamation algorithm takes three 
phases to produce the target polygon t{S): 

1. the query phase which identifies a subset S C S (where S is represented as 
a set of polygon ids); 
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2. the fetch phase which retrieves the polygons whose ids are in S, and unfolds 
polygons from a sequence of points into an array L of line segments; and 

3. the merger phase which computes the target polygon by removing duplicate 
line segments in L. 

Let C query, C fetch and C merge be the respouse times for the three phases respec- 
tively. The response time for an amalgamation algorithm, C total, is the sum of 
these three components. That is: 

Ctotal — Cquery C fetch Cmerge 

Two factors need to be considered for C query First, these three algorithms find 
an S with different numbers of polygons. A larger [S'! may affect C query as well as 
C fetch (since more polygon ids and polygons need to be fetched). Second, these 
three algorithms use methods of different complexity to select S. Algorithm 
SIMPLE simply uses S = S, thus it uses a straightforward selection query for 
this step. Algorithm ADJACENCY selects S in two steps: one to identify the 
boundary polygons and one to identify the sub-boundary polygons. Algorithms 
OCCUPANCY also uses two steps: identification of the boundary cells using a 
query with an aggregate function, followed by a set of queries to retrieve ids of 
the source polygons overlapping with the boundary cells. For both algorithms, 
it is possible to combine these two steps into one SQL query. However, such a 
complex query is far less efficient to execute than executing two separate queries 
for the two steps in an application program. For algorithm OCCUPANCY, both 
the final size of S and the cost for identifying S vary with HWM. 

All the three algorithms retrieve the polygons whose ids are in S. The cost 
for fetching objects (i.e., Cfetch) depends not only on how many polygons to 
be fetched but also on the sizes of these polygons. During this phase, algorithm 
ADJACENCY tags each line segment with its source (i.e., from a boundary 
or sub-boundary polygon). For algorithm OCCUPANCY, it clips line segments 
against boundary Peano cells such that only the line segments which intersect 
with at least one boundary Peano cell are kept for the next phase. With larger 
Peano cells (because of higher HWM), more internal polygons are included in S 
and less line segments can be dropped out by clipping. The middle point of each 
line segment is calculated in this phase for all the three algorithms. 

The line segments are sorted by their middle points and the line segments that 
appear more than once are removed. Algorithm ADJACENCY needs to have an 
additional step to discard all line segments from the sub-boundary polygons. 
The remaining line segments form the target polygon. We do not include the 
time to order the line segments for the target polygon in Cmerge as this cost is 
identical across all the three algorithms. 

4.2 Databases and Parameters 

Each polygon is stored as one object in the database (i.e., we do not consider 
object decomposition), whose schema is: 

P0LYG0N(pid : INTEGER, boundary : POLYGON) 
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where data type POLYGON, which is essentially a sequence of points, can have 
different implementations according to different underlying DBMS. We use Or- 
acle 8 in our tests (its object-relational features are not used, nor is the Spatial 
Data Cartridge, so POLYGON is simply implemented as a BLOB). Other at- 
tributes for a polygon are stored in a separate table and are linked to table 
POLYGON through pid. 

A subset of the TIGER/LINE data (census blocks in California) is used for 
our performance testing (see http://www.census.gov/ftp/pub/geo/www/tiger/). 
A census block has an attribute county id, which is used for grouping source poly- 
gons in our experiments. A county consists of from 9 to 6,022 polygons. There 
are 21,648 polygons with 1,618,950 points in total. The number of points in a 
polygon ranges from 4 to 3,846, with an average of 75 points. We merge census 
blocks into counties, and adjacent counties into larger polygons. The preprocess- 
ing for splitting line segments at joint points and resolving data inconsistency 
problems are done using a CIS package ARC/INFO. 

The adjacency table has the schema: 

ADJACENCY(pid : INTEGER, next.to : INTEGER) 

If polygon p and p' are adjacent to each other, we chose to store both {p, p') 
and (p', p) in table ADJACENCY. The number of tuples in the table is twice more 
than necessary, but the query to identify adjacent polygons is simpler and runs 
faster. The adjacency table for the data set we used has 137,978 rows. Both the 
boundary polygons and sub-boundary polygons for algorithm ADJAGENGY are 
identified using this table, in two steps as mentioned before. 

The occupancy information is recorded as 

OCCUR ANCY(z : INTEGER, pid : INTEGER, occupancy : NUMBER) 

Since all we need to find here is whether a cell is fully occupied by a set of 
polygons or not, we do not compute exact percentage of a polygon in a cell. 
Instead, we count the number of polygons in each cell and simply give each 
polygon an equal share of occupancy. For example, if there 8 polygons in a cell, 
the occupancy for each polygon in that cell is 1/8 = 0.125, regardless their actual 
occupancy ratios. When the total occupancy for a given set of source polygons 
in the cell is 1, we know that the cell is not a boundary cell. (However, an 
accurate calculation of occupancy ratio is necessary in order to support fuzzy 
amalgamation.) The number of rows in table OCCUPANCY, as shown in Table 1, 
varies depending on HWM. 

4.3 Experimental Results 

Now we compare the performance of the three amalgamation algorithms em- 
pirically. For algorithm OGGUPANY, we also test with different HWMs. The 
algorithms are implemented using Microsoft Visual G-T-T and Oracle OGI inter- 
faces. Both development and testing are done using a DELL notebook (Pentium 
11/266) with 128 MB memory. Indexes are created wherever necessary for all the 
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HWM(bytes) 


Average num. of polygons per cell 


Num. of rows in table OCCUPANCY 


512 


4.6 


99638 


1024 


7.1 


65405 


2048 


11.4 


48168 


4096 


18.9 


38336 



Table 1. High water marks (HWM) for the occupancy table. 
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Fig. 5. Number of spatial objects processed (x-axis = |5'|/50) 



tables used. Oracle’s array fetch function is used whenever possible to improve 
the performance of data retrieval. 



Data Size Reduction First, we look at the effectiveness for these algorithms 
in reducing the number of polygons to be processed. Figure 0 (a) shows the 
number of polygons actually fetched by different algorithms, where the x-axis in 
Figure 0 (and Figure 0 is the number of source polygons to be amalgamated 
(ranging from 17 to 21,528). Obviously, the maximum and minimal numbers of 
polygons to be fetched by any amalgamation algorithm are IS”! (i.e., all source 
polygons) and \dS\ (i.e., only the boundary polygons) respectively. These two 
numbers are labeled as ‘source’ and ‘target’ respectively in Figure 0 

Figure 0 (a) reveals three facts: (1) in comparison with algorithm SIMPLE 
which retrieves all the source polygons, both adjacency-based and occupancy- 
based algorithms fetch a much smaller number of polygons; (2) the performance 
of algorithms ADJACENCY and OCCUPANCY scales well when the number 
of source polygons increases (note that the number of polygons to be fetched 
depends not only on the number of source polygons but also the complexity 
of target polygon such as its shape and if there exist holes or not); and (3) 
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the differences among the adjacency-based algorithm and the occupancy-based 
algorithms with different HWMs are significant (the differences among them 
may look deceptive in Figure 21(a) due to much bigger differences between them 
and algorithm SIMPLE). The occupancy algorithm with HWM = 4096 performs 
consistently worse than the adjacency algorithm, which needs to retrieve about 
50% more polygons than the occupancy algorithm with HWM=512. For the 
occupancy algorithm, when HWM increases, more polygons overlap with a Peano 
cell; thus, it is more likely to fetch internal polygons. The implication of different 
HWMs on response times will be examined later. 

For algorithm ADJACENCY, |i9S''''| can be derived as the difference between 
the actual number of polygons fetched by the algorithm and \dS\. One can see 
that dS'^ contains about twice more polygons than dS. 

Figure 0 (b) shows the total number of lines to be processed. Here the max- 
imum and minimum numbers of line segments to be processed by any amalga- 
mation algorithm are the total number of line segments in all source polygons 
(i.e., II5II) which is what algorithm SIMPLE has to process, and the number of 
line segments in the final target polygon (i.e., ||f(S')||). The occupancy algorithm 
discards those line segments not overlapping with any boundary Peano cells. As 
a result, the number of line segments processed by the algorithm (after clipping) 
is much smaller than that by the adjacency algorithm, which in turn processes 
mush less line segments than the simple algorithm. 



Response Times Figure Elshows the response time for each phase as well as the 
total elapsed time. The query time is the elapsed time from when the predicate 
describing source polygons is submitted to when the polygons to be fetched are 
identified. Three factors may affect the query time. First, the size of the table 
used to produce [S'! (i.e., the size of the adjacency and occupancy tables for 
algorithms ADJACENCY and OCCUPANCY respectively). Second, IS”! itself. 
Third, the complexity of the query used for identifying candidate polygons. We 
avoid to use inefficient join query for the adjacency and occupancy algorithms 
(using two-passes as mentioned in section 4.1); thus the complexity of the queries 
used for these three algorithms are similar. Figure 0 (a) illustrates clearly that 
the response time for the query phase is primarily determined by the first factor. 
jS”! is insignificant because of the use of array fetch. Algorithm SIMPLE is the 
fastest since it uses only a simple selection query on the base table that has a 
smaller number of rows than the adjacency or occupancy tables. The adjacency 
algorithm is the slowest, even when it results in smaller IS"! in comparison with 
the occupancy algorithm with HWM=4096. For the occupancy algorithm, a 
smaller HWM results in a larger occupancy table (see Table 1) due to a higher 
probability for one polygon overlapping with many cells m- 

The time for fetching polygons from the database, as shown in Figure 0 
(b), clearly dominates the whole polygon amalgamation process. It is our main 
motivation in this paper to reduce this time. We have achieved this goal by 
reducing this time by approximately 80% in comparison with algorithm SIMPLE. 
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(c) Merge time (d) Total time 



Fig. 6. Response time (seconds) (x-axis = |S'|/50) 



A clear winner is the occupancy algorithm with HWM=512, which is the most 
selective one as shown in Figure 0 (a). 

The biggest time reduction in terms of percentage has been achieved for the 
merger phase. The cost of this phase, obviously, is determined by the number 
of lines to be processed (compare Figure El (c) with Figure El(b)). The occu- 
pancy algorithm, with all three HWMs, performs significantly better because 
of a smaller jS”!, and more importantly, because of discarding line segments by 
boundary cells at the end of the fetch step. 

Figure El (d) shows the total response time. It is clear that the occupancy 
algorithm has achieved a remarkably better overall performance, in particular 
with an HWM which puts an average of 5 - 7 polygons to a Peano cell. That 
is, HWM = 512 or 1024 for the data set we use. A higher HWM degrades the 
performance because it becomes inefficient in filtering out internal polygons and 
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line segments. On the other side, a too low HWM (i.e., less than 512 for the data 
set we used) increases the query cost with little benefit to other steps. 

Finally, we briefly discuss the memory requirement for the amalgamation 
algorithms. Algorithm SIMPLE is the hungriest in terms of memory require- 
ment among the three algorithms, as it needs to hold all line segments in the 
memory. Algorithm ADJACENCY consumes much less memory, only for hold- 
ing line segments from the boundary and sub-boundary polygons. Algorithm 
OCCUPANCY needs the least amount of memory, as it stores only part of line 
segments for the polygons overlapping with boundary Peano cells. We assume in 
this paper that the memory is large enough for holding all line segments to be 
processed in memory. This assumption might become unrealistic when there are 
a large number of polygons to be processed (which is common in spatial OLAP 
and spatial data mining applications). However, based on the fact that only the 
polygons adjacent to each other are to be processed together for the purpose of 
removing duplicate line segments, those algorithms in spatial databases (such 
as the plane-sweep algorithm or in relational databases to handle the 

similar problems (such as the hybrid hashing ^) can be used when the memory 
is not big enough. 



5 Conclusions 

With emerging new applications such as spatial OLAP and spatial data mining, 
certain spatial operations such as polygon amalgamation have become increas- 
ingly popular and its efficient implementation becomes crucial in the realization 
of new spatial applications. In this paper, we have studied efficient algorithms for 
polygon amalgamation. This operation is intrinsically time-consuming. However, 
with the observation that only boundary polygons are playing crucial roles in 
polygon amalgamation, a set of interesting algorithms have been proposed and 
studied in this paper. Starting from improving a simplistic polygon amalgama- 
tion algorithm, we have proposed two methods, adjacency-based and occupancy- 
based, which exclude a large subset of polygons from being considered in the 
amalgamation algorithm without retrieving the spatial description of these poly- 
gons. The performances of these algorithms have been compared using real spa- 
tial data sets. With the support of a more sophisticated data storage structure, 
the occupancy-based method outperforms the adjacency-based method, whereas 
both methods are significantly more efficient than the algorithm which requires 
to fetch all objects to be merged. 

The performance of the occupancy-based algorithm can be further improved 
by decomposing spatial objects. As implemented in some SDBMSs, a spatial 
object can be decomposed with the Peano cells approximating the object. In 
such a case, the occupancy-based algorithm only needs to fetch the parts of a 
spatial object that are inside a boundary cell, instead of the whole object. Such 
object decomposition can be done off-line when a spatial index is built. The on- 
line processing performance for the occupancy-based algorithm can be improved 
greatly as the amount of data to be retrieved is reduced and there is no need to 
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do polygon clipping on-the-fly. Our work on this improvement will be reported 
in a separate paper. In the future we also plan to integrate our new polygon 
amalgamation algorithms with the research results in selective materialization 
for data cube construction m for supporting spatial OLAP and spatial data 
mining applications. 
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Abstract. Spatial data mining recently emerges from a number of real 
applications, such as real-estate marketing, urban planning, weather fore- 
casting, medical image analysis, road traffic accident analysis, etc. It de- 
mands for efficient solutions for many new, expensive, and complicated 
problems. In this paper, we investigate a proximity matching problem 
among clusters and features. The investigation involves proximity rela- 
tionship measurement between clusters and features. We measure prox- 
imity in an average fashion to address possible nonuniform data distri- 
bution in a cluster. An efficient algorithm, for solving the problem, is 
proposed and evaluated. The algorithm applies a standard multi-step 
paradigm in combining with novel lower and upper proximity bounds. 
The algorithm is implemented in several different modes. Our experiment 
results do not only give a comparison among them but also illustrate the 
efficiency of the algorithm. 

Keywords: Spatial query processing and data mining. 



1 Introduction 



Spatial data mining is to discover and understand non-trivial, implicit, and pre- 
viously unknown knowledge in large spatial databases. It has a wide range of ap- 
plications, such as demographic analysis, weather pattern analysis, urban plan- 
ning, transportation management, etc. While processing of typical spatial queries 
(such as joins, nearest neighbouring, KNN, and map overlays) has been received 
a great deal of attention for years [2ldl412H| . spatial data mining, viewed as ad- 
vanced spatial queries, demands for efficient solutions for many newly proposed, 
expensive, complicated, and sometimes ad-hoc spatial queries. 

Inspired by a success in advanced spatial query processing techniques pnEj . 
pi f|f 2f2Sj. relational data mining mm, machine learning , compu- 
tational geometry and statistics analysis many research results and 

system prototypes in spatial data mining have been recently reported 
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|lMl8ll9rill'.l4| . The existing research does not only tend to provide system so- 
lutions but also covers quite a number of special purpose solutions to ad-hoc 
mining tasks. These include efficiently computing spatial association rules 1101 , 
spatial data classification and generalization |1 Ml 5121 f24) , spatial prediction and 
trend analysis | 01 , clustering and cluster analysis [51711 8I2ItI.S‘2| . mining in image 
and raster databases etc. 

Clustering has been proven one of the most useful tools to partition and 
categorize spatial data into clusters for the purpose of knowledge discovery. A 
number of efficient algorithms [5171251, SHS2| have been proposed. Consider that 
the clustering technique might be too expensive to apply to approaching ad-hoc 
spatial data mining tasks. In ITT^ . special purpose mining algorithms have 
been developed, as alternatives to clustering, for solving two ad-hoc problems. 
The first problem is to find the k closest features surrounding a set of points in 
two dimensional space. Such a set of points may be either a cluster obtained by 
a clustering algorithm or an existing spatial object (e.g. a residential area) in the 
database, while a feature is a polygon. The second problem in m is to compute 
the commonality among n sets of points (e.g. n residential areas), provided that 
their k closest features are pre-computed. 

To complement the research in I18I19I . in this paper we study the following 
ad-hoc proximity matching problem (PM) among n sets of points: 



Suppose that in two dimensional space, n sets of points and m sets of 
polygons are given. Regarding to a specific set C of points, find the 
sets of points such that each cluster C meet the proximity matching 
condition - the “shortest distance” from C to each set tt of polygons is 
not greater than that between C and tt. Further, if no sets of points meet 
the proximity matching condition then the “best” approximate solution 
is computed. 



PM problem has a number of useful real applications. For instance, in real- 
estate spatial data, a set of points represent a residential area where each point 
represents a land parcel; a polygon corresponds to a vector representation of 
feature, such as a lake, golf course, school, motor way, etc. In this application, 
a house buyer or a real-estate developer may want to purchase a property in a 
well-known area C because of the proximity relationships to certain surround 
features but may not be able to do it due to either no property available in C or 
a budget limit. Therefore, the purchaser has to alternatively choose the available 
and affordable residential areas most similar to C with respect to these proxim- 
ity relationships. Other applications include road traffic accident investigation, 
criminal analysis, etc. 

PM will be formally defined in the next session. In PM, we assume that the 
“shortest distance” between a set of points and a set of polygon has not been 
pre-computed, nor stored in the database. Further, such a shortest distance will 
be defined in average sense to reflect non-uniform data distribution. These dif- 
ferentiate PM with KNN [211 the problem of searching commonalities among 
n sets of points m, and incremental distance join problem m- 
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A naive way to solve PM is to first precisely compute the distance informa- 
tion between each set of points and each set of polygons, and then to solve PM. 
However, in practice there may be many sets of points far from being part of 
solution to PM. Motivated by this, our algorithm adopts a standard multi-step 
technique | |2I4I1 8r2fl| in combining with novel and powerful pruning conditions to 
filter out uninvolved features and sets of points. The algorithm has been imple- 
mented in several different modes for performance evaluation. Our experiments 
clearly demonstrates the efficiency of the algorithm. 

The rest of the paper is organized as follows. In section 2, we present a precise 
definition of PM as well as brief an adopted spatial database architecture. Section 
3 presents our algorithm for solving PM. Due to the length limitation, in this 
paper we sketch only the proofs of our theoretical results, and the interested 
readers may refer to our full paper m for the detailed proof. Section 4 reports 
our experiment results. In section 5, a discussion is presented regarding various 
modifications of our algorithm. This is followed by the conclusions and remarks. 



2 Preliminary 

In this section we precisely define the PM problem. A feature A is a simple 
and closed polygon EH in the 2-dimensional space. A set C of points in the 
two dimensional space is called cluster for notation simplicity. Following ESI, 
we assume that in PM a cluster is always outside EH a feature. Note that this 
assumption may support many real applications. For instance, in real-estate 
data, a cluster represents a set of land parcel, and a feature represents a man- 
made or natural place of interest, such as lake, shopping center, school, park, 
entertainment center, etc. Such data can be found in many electronic maps in a 
digital library. 

To efficiently access large spatial data (usually tera-bytes), in this paper we 
adopt an extended-relational and a SAND (spatial-and-non-spatial database) 
architecture |3|. That is, a spatial database consists of a set of spatial objects 
and a relational database describing non-spatial properties of these objects. For 
instance, a set of electronic data describing Sydney metropolitan area may be 
organized as follows. 

— SUBURB (name, #houses, #units, average_price, ..., g-des), 

— GOLF COURSE (name, #holes, ..., g-des), 

— SCHOOL (name, type, ... , g-des), 

— BEACH (name, type, ..., g-des). 

In the above database schemata, the attribute g-des represents a spatial 
object, which is either a set of points or a polygon in PM. In order to achieve 
efficient access, in SAND the attribute g-des stores only a pointer in the relational 
table, pointing to the actual spatial object description. Below shows an example 
of PM: 
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Example 1. select s.* 

from SUBURB s, SUBURB si, GOLF COURSE g, 

BEACH w, SCHOOL sc 

where si. name = ’Randwick’ and s.name ^ Randwick’ and 
s.average-price < 400,000 and g.#holes = 9 and sc.type = ’private’ 
proximity-matching between {s.obj, sl.obj) regarding 
their shortest distances to g, sc, and w □ 

Example 1 is to find the suburbs with area average house price less than 
$400,000, such that their individual shortest distances to the golf courses with 
9 holes, to private schools, and to beaches are respectively smaller than those 
between the suburb Randwick and the features. If such suburbs do not exist 
then the suburbs most approximately meet the proximity matching conditions 
will be reported. 

Taking the above query as an example, we now formally define PM. In PM, 
the input consists of: 

— a cluster Co (e.g. the suburb Randwick in Example 1), 

— a set 5 of clusters (e.g. the suburbs with average_price not greater than 

$400,000 in Example 1), 

— a set 7T = {nj : 1 < j < m} of groups of features (e.g. in Example 1, m = 3, 

7Ti is the set of golf courses with 9 holes, tt 2 is the set private schools, and 

7T3 is the set of beaches). 

Given a feature F and a point p outside F, the length of the actual (working 
or driving) shortest path from p to E is too expensive to compute in the presence 
of tens of thousands of different roads. In PM, we use the shortest Euclidean 
distance from p to a point in the boundary of F, denoted by d{p, F), to reflect the 
geographic proximity relationship between p and F. We believe that on average, 
the length of an actual shortest path can be reflected by d{p, F). We call d{p, F) 
the distance between p and F. Note that if F degenerates to a point p' then 
d(p,p') means the Euclidean distance between them; and F may also degenerate 
to a line. Moreover, for the purpose of computing lower and upper proximity 
bounds in Section 3, we need to extend the definition of d(p, F) to cover the case 
when p is inside or on the boundary of P; that is, d(p, P) = 0 if p is inside P. 

A proximity value between a cluster C and a feature F can be defined in a 
number of ways. We may define it by the shortest distance between the “bound- 
ary” of C and the boundary of P. However, as points in C admit an arbitrary 
distribution, such a proximity value may not be the majority consensus from C; 
this was shown in m- We use the following average proximity value to quanti- 
tatively model the proximity relationship between P and C: 

AP(C,P) = ^^d(p,P) (1) 

' ' pGC 

Consider that in PM a set tt of features normally means a set of the same 
kind of features. We define the distance between a set tt of features and a cluster 
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C to be the smallest average proximity between C and a, F G tt, and it is denoted 
by D{C,tt). That is, 



D{C, 7t) = mm{AP{C, F)} (2) 

As mentioned earlier, if in PM no cluster meets the above requirements, then 
the proximity matching needs to find clusters that achieve the requirement most; 
in this case, we rank the importance of a set tt^- of features by a positive value 
Wj. The more important a feature TTj is, the larger wj is. The wj can be assigned 
by either a user or the system default. Therefore, a set {wj \ 1 < j < m} of 
positive values is also part of the input of PM. PM can now be modeled as to 
find the clusters C in S' such that the following goal function is minimized. 



PMca{C,n) = ^Wj pm{D{C\TTj), D{Co,TTj)) (3) 

i=i 

where. 



pm{D{C, TTj), D{Co, TTj)) 



0 it D{C,ivj) < D{Co,TVj) , . 

D{C,nj) — D{Co,TTj) otherwise ^ ' 



3 Algorithms for Solving PM 

In this section, we present an efficient algorithm for solving PM. The algorithm 
is denoted by CPM, which stands for Computing the Proximity Matching. 

An immediate way (brute-force) to solve PM is to 1) compute AP{C, F) 
firstly for each pair of a cluster C and a feature F, 2) secondly compute D{C, tt^) 
for every pair of a C and a iTj, 3) thirdly compute PMcq (C, U) for each cluster C 
in S, and 4) finally find the clusters C with the smallest values of PMcdC, 77). 
Note that AP{C, F) can be easily computed in 0(|C| |F|) according to the defini- 
tion of AP{C, F); and it is the dominant cost. Though the brute-force approach 
runs in quadratic time regarding the input size, there may be hundreds clusters 
and tens of thousands features involved in the computation. Moreover, each clus- 
ter (feature) may have a number of points (edges). This makes the brute- force 
approach computationally prohibitive in practice; and our experiment results in 
Section 4 confirm this. 

An alternative way to approach PM is to adopt a multi-step paradigm m, 
iSf2()j . That is, we firstly apply a coarse and fast computation. Instead of com- 
puting the actual value of AP(C, F) in quadratic time, we may compute a lower 
bound and an upper bound for AP{C,F) in a constant time 0(1). By these 
bounds, for each cluster O in S' we can rule out the features in a tTj , which are 
definitely not closest to C; and thus we do not have to precisely compute the 
average proximity values between these eliminated features and C. Secondly, we 
can deduce a lower bound and an upper bound for each PMca(C, FI) from the 



Efficiently Matching Proximity Relationships in Spatial Databases 193 



bounds of AP\ and then filter out uninvolved clusters. This is the basic idea of 
our algorithm. In our algorithm CPM, we have not integrated our algorithm into 
a particular spatial index, such as i?-trees, i?+-trees, etc, due to the following 
reasons: 

— There may be no such a spatial index built. 

— The PM problem may involve many features from different tables/electronic 
thematic maps; and thus, spatial index built for each thematic map may be 
different. This brings another difficulty to make use of spatial indices. 

— A feature or a cluster, which is qualified in PM, may be only a part of a 
stored spatial object; for instance, user can be interested in only certain 
part of a residential area. This makes a possible existing index based on the 
stored spatial objects not applicable. 

— The paper m indicates the existing spatial indexing techniques do not nec- 
essarily support well the computation of aggregate distances; the argument 
should be also applied to average distance computation. 

The algorithm CPM consists of the following 5 steps: 

Step 1: Read the relevant clusters into buffer. 

Step 2: Read features batch by batch into buffer and determine their groups by 
validating the selection conditions against the relational tables. 

Step 3: For each feature F' in a tt^-, compute lower and upper bounds of AP(C, F) 
for a cluster C. Then determine whether or not F should be kept for the 
computation of D{C,'Kj). 

Step 4: For each and each cluster C, compute lower and upper bounds for 
D{C,TTj)\ and then derive lower and upper bounds for 10). Filter out clusters 
which will not be part of the solution to PM. 

Step 5: Apply the above brute- force method to the remaining clusters and their 
associated features to solve PM. 

In the next several subsections we detail the algorithm step by step. Clearly, 
a success of the algorithm CPM largely relies on how good the lower and upper 
bounds of AP are. The goodness of lower and upper bounds means two things: 
1) the bounds should be reasonably tight, and 2) the corresponding computation 
should be fast. We first present the lower and upper bounds. 

Note that for presentation simplicity, the algorithms presented in the paper 
are restricted to the case when features and clusters qualified in PM are stored 
spatial objects in the database. However, they can be immediately extended to 
cover the case when a feature or a cluster is a part of a stored object. 

3.1 Lower and Upper Bounds for Average Proximity 

In this subsection, we recall first some useful notation. The barycenter (centroid) 
of a cluster C is denoted by b{C). A convex |23 polygon encompassing a feature 
F is called a bounding convex polygon of F . The smallest bounding convex 
polygon of F is called the convex hull m of F and is denoted by Pp- An 
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isothetic rectangle is orthogonal to the coordinate axis. The minimum bounding 
rectangle of F refers to the minimum isothetic bounding rectangle of F and is 
denoted hy Rp- Similarly, we denote the convex hull of a cluster C by Pc and 
denote the minimum bounding rectangle of C by Rc- The bounds presented 
in the subsection are based on either minimum bounding rectangles or convex 
hulls. 

Given a Rc and a Rp, an immediate idea is to use the shortest distance 
and the longest distance between Rc and Rp to respectively represent a lower 
bound and an upper bound of AP{C, F). However, this immediate idea has two 
problems. The first problem is that when two rectangles intersect with each 
other (note that in this case C and F do not necessarily have an intersection), 
the shortest and longest distances between Rc and Rp are not well defined. 
The second problem is that the bounds may not be very tight even if the two 
rectangles do not intersect. These also happen similarly for convex hulls. Below, 
we present new and tighter bounds. 

Our lower bound computation is based on the following Lemma. 

Lemma 1. + vf > \J 

Proof: It can be immediately verified that the inequality holds when K = 2. By 
mathematical induction, we can prove the Lemma. □ 

From Lemma n Theorem 0 immediately follows. 

Theorem 2. Suppose that C is a cluster, F is a feature, and P is either the con- 
uex hull or the minimum bounding rectangle of F. Then, AP(C, F) > d(h(C), P); 
in other words d{b{C),P) is a lower bound of AP{C,F). 



Figure 0 gives an example, and shows that our lower bound is tighter than 
the shortest distance between two rectangles. 




The Rp oi & feature F has four edges: the left boundary, right boundary, 
bottom boundary, and top boundary. Note that each boundary edge x is divided 
into several line segments (at least two). The two end points of such a line 
segment are either a) a pair of two adjacent intersection points between F and 
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Rf, or b) a vertex oi Rp and an intersection point between F and Rp- We use 
HRp x to denote the maximal segment length among these line segments in the 
boundary edge x, where x € {Z,r, 6,t} respectively represents either the left, or 
right, or bottom, or top boundary. For each boundary edge x, we use VRp,x to 
denote the maximal length of the perpendicular line segment from a vertex of 
an edge of F, which faces E7I x. Figure l^^a) illustrates these concepts. 

Rp Rc 




Suppose that P' is either the convex hull or the minimum bounding rectangle 
of a cluster C, and p is a point inside P' . We use A(p, P') to denote the maximal 
distance between p and a point contained in P' . Clearly, A(p, P') is the maximal 
distance from p to one of the vertices of P' . Figure |21Ib) illustrates this concept 
for the minimum bounding rectangle. 

Below we present two upper bounds respectively for convex hulls, and mini- 
mum bounding rectangles. 

Theorem 3. Suppose that Pc is the convex hull of a cluster C, p is a point 
inside Pc, and F is a feature. Then AP{C, F) < d{p, F) + A(p, Pc)- 

Sketch of the Proof: The theorem can be verified according to the definitions 
of AP, d, and A. □ 

It is clear the right hand side in the inequality of Theorem 01 can be used 
as an upper bound of AP. However, the computation of d{p, F) runs in time 
0(|F'|). The lower bound presented below in Theorem 0is based on the minimum 
bounding rectangles and can be computed in constant time, though it is not as 
tight as that in Theorem 01 First we should note that Theorem 01 also holds if 
we replace Pc by Rc. 

Theorem 4. Suppose that Rc of C is given, Rp of F is given, and p is a point 
contained in Rc. Then 

d{p,F) + \{p,Rc) < min {mm{HRp^x,VRF^x} + d{p,x)} + X{p,Rc).{5) 

xG{l,r,b,u} 

Here x a boundary edge of Rp. 
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Proof: From the definition of HR and VR, the theorem can be immediately 
verified. □ 

Theorem^together with Theorem pimply the right hand side in the inequal- 
ity of Theorem El is another upper bound of AP. Further, it should be clear that 
this upper bound can be computed in constant time, provided that HRp x and 
VRp,x are obtained and X{Rc,p) is computed for a given p. 




Fig. 3. Upper bounds 



The total lengths of the thick dotted lines in Figure EJa) and Figure I3^b) 
respectively show the upper bounds in Theorems 0 and 0. They also show that 
our bounds may be tighter than the longest distance between the convex hulls 
or between the minimum bounding rectangles. However, we cannot generally 
prove this because the tightness of the bounds depends on the choice of p. In 
our algorithm, we will choose the centroid of a cluster C in the upper bound 
computation since it has to be used to obtain a lower bound. 

In the next several subsections, we present first the algorithm CPM based 
on the minimum bounding rectangles. 



3.2 Read in Clusters 

In Step I, we first read in the clusters, specified in the query by a user, into 
buffer by execution of the data retrieval method [ 3 ]; for instance, regarding the 
query in Example 1 the suburbs with average house price less than $400,000 are 
read into the database by querying the SUBURB table. Then, we compute 
6(C), and X{b{C),Rc) for each cluster C. Clearly, this step takes linear time 
with respect to the total size of clusters. 
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3.3 Read In and Filter Out Features 

This subsection presents Step 2 and Step 3. Consider that the number of the fea- 
tures to be processed may be very large, and each feature may have many edges. 
It may be impossible to keep all features in buffer all the time. Consequently, 
features should be read into buffer batch by batch. Once a batch of features 
are read in, they are first assigned group IDs against users specifications; for in 
stance, in Example 1 three feature groups are retrieved. 

After a feature is assigned a group ID, the algorithm CPM invokes the filter- 
ing process in Step 3. It is based on the following lemma. 

Lemma 5. Suppose that C is a cluster, and tt = {Fi, F 2 , Fk} is a group of 
features. For 1 < j < fc, let LBap{C, Fj) and UBap{C\ Fj) be a lower bound 
and an upper bound of AP{C,Fj). Then: 

\LBAp{C,Fj)} < D{C,tt) < min {UBap{C, Fj)}. 

Proof: The lemma immediately follows from the definition of D. □ 

For notation simplicity, we extend S to include the given Cq; that is, S = 
{Ci : 0 < i <n}. With respect to each pair of Ci and iTj, we use LB’-’^ to record 
the minimum value of the lower bounds of AP{Ci, F) for each F G tTj, and use 
UB^^^ to record the minimum values of the upper bounds. Lemma El says that 
and are relatively a lower and an upper bound of D{Ci,nj). 

In our algorithm, we initially set both LB'^’^ and UB^'^ to 00 , and then 
gradually update them when a new feature in iTj is processed. We use a dynamic 
array Aij to store the candidate features in iTj for computing AP(Ci,TTj). Each 
element in Aij stores the identifier FID of a feature F, the obtained lower 
bound of AP(Ci, F), and the pointer gjdes that points to the spatial description 
of the cluster. 

Specifically, to process a F S tt^, CPM firstly computes Rp and the values 
of PF and HF for Rp] this can done easily by scanning F only once. Secondly, 
for each Ci, we check if F should be included in Aij. According to Lemma 0 we 
add F to Aij if the obtained lower bound of AP{Ci, F) is less than the current 
value of UB'^’F Further, we should update LF*b ^nd UB'^’^ each time after F is 
added to ; and then check Aij to determine if some features in Aij should 
be removed due to an update of UB'^’F To prevent unnecessary scan of Aij each 
time after UB’’’^ is updated, we also record the maximum value of the lower 
bounds of AP{Ci, F) among the features F in the current ; it is recorded by 
jg initially zero). 

For example, suppose that the features Fi, F 2 , F 3 , and F 4 are in tti. Their 
lower and upper bounds of AP values against a cluster Ci are (4,5) for Fi, 
(6,7) for F 2 , (3,6) for F 3 , and (3,3.5) for F 4 ; see Figure El Initially, FF^4 = 
CF^d = ooj and = q. Suppose that Fi is first processed. We add Fi to 
Ai.i by recording its ID and the lower bound; and then FF^d = 4 ^ FF^d = 5 ^ 
and qiA = 4. Next, F 2 is processed. F 2 should not be included in Ain because 
6 > UB^-^ . However, when F 3 is processed thirdly, F 3 should be added in Ai_i. 
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Consequently, LB^-^ = 3, UB^-^ = 5, = 4. While processing i^ 4 , we find that 

i ^4 should be added to Ai^i. Accordingly, LB^’^ = 3, UB^’^ = 3.5, and = 4. 
Then, since > UBi^i we should check Ai_i to delete Fi from Ai_i. 



AP(Cj, Fj) AP(Cj, F2) 




Fig. 4. Filter Out Features 



More precisely, to process a feature F G nj, the above processes can be 
presented by the following pseudo codes: 

compute Rf, VFr^oo and HFr^, /* x = {l,r,b,t} */ 
for each cluster Ci do { 

LBAp{Ci, F) ai (the lower bound); UBAp{Ci, F) (the upper bound); 

if oi < UB^’^ then { 

min{ai,LB*’J} ^ LfiO. 
min{a2,C/S*’^} ^ UB^’F^ 

max{g®d^a^} _> 

if > UB^^i then { 

remove features F from if LBAp{Ci, F) > UB^'F^ 
re-compute from the remaining features in Ai j] } 

} 

} 

Once a batch of features are processed by the above procedure, we do not 
keep them in buffer if no space is left for the next batch of features. In this 
situation, we will need to read in again the features not filtered out after Step 
4, so that we can process Step 4; this is why we want to keep the pointer g_des 
for each feature object. 



3.4 Filter Out Clusters 

This subsection describes Step 4. After the computation in last subsection, we 
obtained LB'‘d ^nd UB'^d for each pair of Ci and fo . By Lemma we have: 



LB^'^ < D{Ci,TTj) < UB^'F 



( 6 ) 
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Note that from the definition, pm{D{Ci,TTj), D{Co,TTj)) = 0 if D{Ci,TTj) < 
D{Co,TTj). This immediately implies that pm{D{Ci, nj), D{Co,TTj)) = 0 if one of 
the following two conditions applies: 

1. UB^’^ < or 

2 . UB^’^ < D{Co,tTj) 

Thus, intuitively pm(Zl(C'i, nj), D{Co,TTj)) is bounded by U B^’^ —LB'^’B Further, 
it cannot be smaller than the minimum distance between the two closed intervals: 
[LB'‘'\ UB"^’^] and [LB^'^ , UB^'^]. The intuition can be immediately verified and 
is stated in Lemma El Let 

_ r UB^’^ - LB°'^ if UB^’^ > LB°'^ 
y 0 otherwise 



and let 



r LB^'3 - UB°'i If Lfibi > [/^o.i 
[ 0 otherwise 



Lemma 6. pm{D{C,Trj), D{ci,7Tj)) < and pm{D{C,Trj), D{ci,TTj)) > 
for each pair of Ci and TTj , 

Sketch of the Proof: Prove the theorem by applying (E|) and the above two 
conditions. □ 

LemmaEI gives us a lower and upper bound of PAIcgiCi, U) for each Cp. 



^ < PMciCi, 7T) < ^ w.P^’P (7) 

i=i i=i 

In Step 4, we firstly compute the lower and upper bounds for each cluster Ci 
(i yf 0), as given in Q- Secondly, we compute the minimum value r of the upper 
bounds among the clusters Ci in S but * yf 0. Thirdly, we scan S to filter out 
clusters Ci if > t. This procedure runs in time 0{nm). 

3.5 Precise Computation 

This subsection describes Step 5. After pruning the clusters, the information of 
remaining features for solving PM is kept in each Aij. We need to read in these 
features again to perform the precise computation. Since two different Aijs may 
keep a same feature as a candidate, to efficiently read in required features we 
use a hashing method. A hash table H is created against feature ID - FID. We 
scan Aij one by one to execute the following two steps. 

Step 1: For each feature F € Aij, hash its FID into an H entry and then 
determine if its spatial description is already in H. 



200 Xuemin Lin, Xiaomei Zhou, and Chengfei Liu 



Step 2: If the spatial description of F is not in iL, then read it into buffer using 
gjdes and store it in H. Then, compute AP(Ci, F) for each remaining Ci. 

After computing all AP values, we apply the remaining step of the brute-force 
method to the existing clusters to solve the problem. Note that before starting 
Step 5, we also check a special case - if for all remaining clusters and 

features then CPM does not have to process Step 5 but outputs the remaining 
clusters as the solution. 



3.6 Complexity of CPM 

In this subsection, we analyze the complexity of CPM. Step 1 takes linear time 
with respect to the total sizes of clusters; that is |Ci|)- Step 2 and Step 

3 take time 0{n x KiD- Step 4 takes time 0{nm), while step 5 takes time 
0{'l2\jCiWFeAi j |C'i||P|) for the remaining clusters Ci. 

Note that the brute-force algorithm runs in time 0{J2ycyF IC'llPj)- It should 
be clear that in practical, the time complexity of the brute-force method is much 
higher than that of CPM. This is confirmed by our experiment results in Section 
4. 

3.7 Different Modes of CPM 

The above mode of CPM uses only the minimum bounding rectangles; and it is 
denoted by CPM-R. 

An alternative mode to CPM-R is to use a multiple- filtering technique mm-- 

— first the minimum bounding rectangles are used in Steps 3 and 4, and then 

— the convex hulls for features and clusters are adopted to repeat Steps 3 and 
4 before processing Step 5. 

It is denoted by CPM-RH. In CPM-RH, we employ the divide and conquer 
algorithm 123 to compute convex hulls for clusters. To compute a convex hull 
for a feature (simple polygon), we employ the last step in Graham’s scan [77j . 
which runs in linear time. We use the upper bound of AP in Theorem El to 
implement the procedure in Section 3.3 for convex hulls. 

Another alternative mode to CPM-R is to use only the convex hulls instead 
of the minimum bounding rectangles. We denote this mode by CPM-H. 

In next section, we will report our experiment results regarding the perfor- 
mances of the brute-force algorithm, CPM-R, CPM-RH, and CPM-H. 



4 Experiment Results 

The brute-force algorithm and the three different modes of CPM have been im- 
plemented by CH — h on a Pentium 1/200 with 128 MB of main memory, running 
Window-NT 4.0. In our experiments, we evaluated the algorithms for efficiency 
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and scalability. Our performance evaluation is basically focused on Step 3 on- 
wards, because the methods 0 of reading in clusters and features are not our 
contribution. Therefore, in our experiment we record only the CPU time but 
exclude I/O costs. 

We developed a program to generate a benchmark. In the program, we first 
use the following random parameters to generate rectangles, such that a rectangle 
may intersect with at most another rectangle: 

— M gives the number of rectangles, and 

— widfi controls the width of a rectangle R and hji controls the height of R. 

More specifically, we first generate M rectangles R with a random width widu 
and a random height h/j, where 1 < wida, Hr < 1000. The generated rectangles 
are randomly distributed in the 2-dimensional space, and intersect with at most 
another rectangle. To generate n clusters, we randomly divide the whole region 
into n disjoint sub-regions and choose one rectangle from each region. We use a 
random parameter NC to control the average number of generated points in each 
rectangle among the chosen n rectangles. These give n clusters. The remaining 
M — n rectangles correspond to features. We use another random parameter 
NF to control the average number of the vertices (points) generated in each 
remaining rectangle. 

Note that if R intersects with R' , we actually generate the points respectively 
in R and R' — R it R' is not included in R. Further, in each rectangle R corre- 
sponding to a feature, we apply a Graham’s scan-like algorithm to produce 
a simple polygon connecting each vertex in R\ the generated simple polygon is 
used as a feature. Therefore, in our benchmark we have two kinds of spatial 
objects - clusters and features. 

In the experiments below, we adopt a common set of parameters: 1) a feature 
has 150 edges on average, 2) a cluster has 300 points on average, and 3) the 
features are grouped into 20 groups. 

In our first experiment, we generate a database with 10 clusters and 1000 
features. The experiment results are depicted in Figure 0 where the algorithm 
CPM-R, CPM-H, and CPM-RH are respective abbreviated to “R”, “H”, and 
“R-H” . 

From the first experiment, we can conclude that the brute-force is practically 
very slow. Another appearance is that H is slower than R and R-H due to the fact 
that in H, the computation of lower and upper bounds for each pair of cluster 
and feature does not take constant time. Intuitively, H should be significantly 
slower than R and R-H when the number of clusters and features increases; this 
has been confirmed by the second experiment. 

The second experiment has been undertaken through two “dimensions” : 

— Fix the number of features to be 1000, while the number of clusters varies 

from 5 to 20. The results are depicted in Figure El 

— Fix the number of clusters to be 200, while the number of features varies 

from 1000 to 10000. The results are depicted in Figured 
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Fig. 5. A Comparison among four algorithms 
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Fig. 6. A Comparison among H, R, and R-H 



Note that in the second experiment, we run i?, H, and R — H 40 times against 
each database. This experiment, together with the first experiment, also demon- 
strates that R — H is faster than H on average. 

In the third experiment, to test the scalability we vary the database size from 
1000 features to 50000 features but fix the the number of clusters to be 200. For 
each database, we run both R and R-H 40 times against each database. Figure 
0 illustrates our experiment results. 

The three conducted experiments suggest that our algorithm is efficient and 
scalable. Secondly, we can see that though an application of convex hulls to our 
filtering procedures is more accurate than an application of minimum rectangles. 



execute time (s) execute time (s) 
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but it is too expensive to use directly. The best use of convex hulls should follow 
an application of minimum bounding rectangles. 
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Fig. 7. Another Comparison among H, R, and R-H 
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5 Discussion 

The problem PM and the algorithm CPM may be either generalized or con- 
strained according to various applications. In this section, we present a discussion 
on these issues. 

A slight modification of the algorithm CPM can be applied to the case where 
we specify the proximity matching conditions directly use the shortest distances 
instead of specifying a given cluster. For instance, in Example 1 we may directly 
specify that such suburbs are within 5 km away from a private school, 6 km 
away from a beach, and 2 km away from a golf course, instead of comparing to 
the suburb Randwick. 

Another modification of the problem is to define pm{D{C, iTj), D{Co,TTj)) as 
\D{C,TTj) — D{Co,TTj)\. It can be shown P3] that our algorithms can be imme- 
diately modified accordingly to resolve this. 

Our results and discussions, so far, are limited to the Euclidean distance. 
Note that the upper bounds presented in Section 3.1 are based on a triangular 
inequality in the Euclidean distance; that is d{pi,p 2 ) < d{pi,pz) + d{p^,p 2 )- 
Since the triangular inequality is part of the definition of any metric distance, 
the upper bounds can be applied to any metric space. We should also note that 
the lower bound presented in Section 3.1 can be obtained in such a metric space 
that the metric distance 7 follows the two constraints below: 

- 7(Pi,P2) + 7(P3,P4) > 7(Pi +P3,P2 +P 4 ), and 

— 7(c X pi, c X P 2 ) = |c|7(pi,P 2) for any constant c. 

Consequently, we can extend the problem PM and the algorithm CPM to any 
metric space where the above two constraints are satisfied. For instance, we can 
verify that the Manhattan distance satisfied the two constraints; and thus 
our algorithm can be extended to 2-dimensional Manhattan distance space. 

6 Conclusions 

In this paper, we formalized a new problem (PM) in spatial data mining from real 
applications. We presented an efficient algorithm based on several novel pruning 
conditions, as well as various different modes of the algorithm. Our experiment 
results showed that the algorithm is very efficient and can support a number of 
real applications where data with huge volume are present. 

Further, in Section 5, we showed that our work in this paper can be extended 
to many other metric spaces. 

Note that the PM problem and the algorithm CPM are restricted to the 
case where a cluster is outside a feature. This restriction may be not generally 
applicable to some applications; we are now identifying such applications. Be- 
sides, we are currently working on the development of indexing techniques to 
support CPM. Further, a modification of PM to cover the applications where 
the distance from a cluster to a set of features is not necessarily restricted to 
one feature in the set seems more complicated; this is our another future study. 
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Abstract. Classification is one of the basic tasks of data mining in modem 
database applications including molecular biology, astronomy, mechanical 
engineering, medical imaging or meteorology. The underlying models have to 
consider spatial properties such as shape or extension as well as thematic 
attributes. We introduce 3D shape histograms as an intuitive and powerful 
similarity model for 3D objects. Particular flexibility is provided by using 
quadratic form distance functions in order to account for errors of measurement, 
sampling, and numerical rounding that all may result in small displacements and 
rotations of shapes. For query processing, a general filter-refinement architecture 
is employed that efficiently supports similarity search based on quadratic forms. 
An experimental evaluation in the context of molecular biology demonstrates 
both, the high classification accuracy of more than 90% and the good performance 
of the approach. 



Keywords. 3D Shape Similarity Search, Quadratic Form Distance Functions, 
Spatial Data Mining, Nearest Neighbor Classification 



1 Introduction 

Along with clustering, mining association rules, characterization and generalization, 
classification is one of the fundamental tasks in data mining [CHY 96]. Given a set of 
classes and a query object, the problem is to assign an appropriate class to the query 
object based on its attribute values. Many modern database applications including mo- 
lecular biology, astronomy, mechanical engineering, medical imaging, meteorology and 
others are faced with this problem. When new objects are discovered through remote 
sensing, new tumors are detected from X-ray images, or new molecular 3D structures 
are determined by crystallography or NMR techniques, an important question is to 
which class the new object belongs. Further steps to deeper investigations may be guid- 
ed by the class information: a prediction of primary and secondary effects of drugs could 
be tried, the multitude of mechanical parts could be reduced, etc. 

As a basis for any classification technique, an appropriate model has to be provided. 
Classes represent collections of objects that have characteristic properties in common 
and thus are similar, whereas different classes contain objects that have more or less 
strong dissimilarities. In all of the mentioned applications, the geometric shape of the 



R.H. Guting, D. Papadias, F. Lochovsky (Eds.): SSD’99, LNCS 1651, pp. 207-226, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 




208 Mihael Ankerst et al. 



objects is an important similarity criterion. Along with the geometry, also thematic at- 
tributes such as physical and chemical properties have an influence on the similarity of 
objects. 

Data from real world applications inherently suffer from errors, beginning with er- 
rors of measurement, calibration, sampling errors, numerical rounding errors, displace- 
ments of reference frames, and small shifts as well as rotations of the entire object or 
even of local details of the shapes. Though no full invariance against rotations is gener- 
ally required, if the objects are already provided in a standardized orientation, these 
errors have to be taken into account. In this paper, we introduce a flexible similarity 
model that considers these problems of local inaccuracies and may be adapted by the 
users to their specific requirements or individual preferences. 

The paper is organized as follows: The remainder of this introduction surveys related 
work from molecular biology, data mining, and similarity search in spatial databases. In 
Section 2, we introduce the components of our similarity model: 3D shape histograms 
for object representation, and a flexible similarity distance function. Due to the large and 
rapidly increasing size of current databases, the performance of query processing is an 
important task and, therefore, we introduce an efficient multistep system architecture in 
Section 3. In Section 4, we present the experimental results concerning the effectiveness 
and efficiency of our technique in the context of molecular biology. Section 5 concludes 
the paper. 

1.1 Classification in Molecular Databases 

A major issue in biomolecular databases is to get a survey of the objects, and thus a basic 
task is classification: To which of the recognized classes in the database does a new 
molecule belong? In molecular biology, there are already classification schemata avail- 
able. In many systems, classifying new objects when inserting them into the database 
requires supervision by experts that are very experienced and have a deep knowledge of 
the domain of molecular biology. What is desired is an efficient classification algorithm 
that may act as a fast filter for further investigation and that may be restricted e.g. to 
geometric aspects. 

A sophisticated classification is available from the FSSP database (Families of 
Structurally Similar Proteins), generated by the Dali system [HS 94] [HS 98]. The sim- 
ilarity of two proteins is based on their secondary structure, that is substructures of the 
molecules such as alpha helices or beta sheets. The evaluation of a pair of proteins is 
very expensive, and query processing for a single molecule against the entire database 
currently takes an overnight run on a workstation. 

Another classification schema is provided by CATH [OMJh- 97], a hierarchical clas- 
sification of protein domain structures, which clusters proteins at four major levels, class 
(C), architecture (A), topology (T) and homologous superfamily (H). The class label is 
derived from secondary structure content and cannot be assigned for all protein struc- 
tures automatically. The architecture label, which describes the gross orientation of sec- 
ondary structures, independent of connectivities, is currently assigned manually. The 
assignments of structures to topology families and homologous superfamilies are made 
by sequence and structure comparisons. 
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1.2 Nearest-Neighbor Classification 

A lot of research has been performed in the area of classification algorithms; surveys are 
presented in [WK 91] [MST 94] [Mit 97]. All the methods require that a training set of 
objects is given for which both the attribute values and the correct classes are known a- 
priori. Based on this knowledge of previously classified objects, a classifier predicts the 
unknown class of a new object. The quality of a classifier is typically measured by the 
classification accuracy, i.e. by the percentage of objects for which the class label is 
correctly predicted. 

Many methods of classification generate a description for the members of each class, 
for example by using bounding boxes, and assign a class to an object if the object match- 
es the description of the class. Nearest neighbor classifiers, on the other hand, refrain 
from discovering a possibly complex description of the classes. As their name indicates, 
they retrieve the nearest neighbor p of a query object q and return the class label of p in 
order to predict the class label of q. Obviously, the definition of an appropriate distance 
function is crucial for the effectiveness of nearest neighbor classification. In a more 
general form, called k-nearest neighbor classification, k nearest neighbors of the query 
object q are used to determine the class of q. Thus, the effectiveness depends on the 
number k as well as on the weighting of the k neighbors. Both, appropriate similarity 
models as well as efficient algorithms for similarity search are required for successful 
nearest neighbor classification. 

1.3 Geometry-Based Similarity Search 

Considerable work on shape similarity search in spatial database systems has been per- 
formed in recent years. As a common technique, the spatial objects are transformed into 
high-dimensional feature vectors, and similarity is measured in terms of vicinity in the 
feature space. The points in the feature space are managed by a multi-dimensional index. 
Many of the approaches only deal with two-dimensional objects such as digital images 
or polygonal data and do not support 3D shapes. 

Let us first survey previous 2D approaches from the literature. In [GM 93], a shape 
is represented by an ordered set of surface points, and fixed-sized subsets of this repre- 
sentation are extracted as shape features. This approach supports invariance with respect 
to translation, rotation and scaling, and is able to deal with partially occluded objects. 
The technique of [BKK 97] applies the Fourier transform in order to encode sections of 
polygonal outlines of 2D objects; even partial similarity is supported. Both methods 
exploit a linearization of polygon boundaries and, therefore, are hard to extend to 3D 
objects. In [Jag 91], shapes are approximated by rectangular coverings. The rectangles 
of a single object are sorted by size, and the largest ones are used for the similarity 
retrieval. The method of [KSFh- 96] is based on mathematical morphology and uses the 
max morphological distance and max granulometric distance of shapes. It has been ap- 
plied to 2D tumor shapes in medical image databases. A 2D technique that is related to 
our 3D shape histograms is the Section Coding technique [Ber 97] [BK 97] [BKK 97a]. 
For each polygon, the circumscribing circle is decomposed into a given number of sec- 
tors, and for each sector, the area of the polygon inside of this sector divided by the total 
area of the polygon is determined. Similarity is defined in terms of the Euclidean dis- 
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tance of the resulting feature vectors. The similarity model in [AKS 98] handles 2D 
shapes in pixel images and provides a solution for the problem of small displacements. 

The QBIC (Querying By Image Content) system [FBF+ 94] [HSE+ 95] contains a 
component for 2D shape retrieval where shapes are given as sets of points. The method 
is based on algebraic moment invariants and is also applicable to 3D objects [TC 91]. As 
an important advantage, the invariance of the feature vectors with respect to rigid trans- 
formations (translations and rotations) is inherently given. Flowever, the adjustability of 
the method to specific applications is restricted. From the available moment invariants, 
appropriate ones have to be selected, and their weighting factors may be modified. 
Whereas the moment invariants are abstract quantities, the shape histograms presented 
in this paper are more intuitive and may be graphically visualized, thus providing an 
impression of the suitability for specific applications. 

The Geometric Hashing paradigm for model-based 3D object recognition was intro- 
duced by [LW 88]. The objects are represented by sets of points; from these points, non- 
collinear triplets are selected to represent different orientations of a single object. For 
each of these orientations, every point of an object is stored in a hash table that maps 3D 
points to objects and their orientations. The query processing heuristic requires a certain 
threshold provided by the user. This threshold has a substantial impact on the effective- 
ness of the technique and, thus, an appropriate choice is crucial. If the threshold is too 
high, no answer is reported; if the threshold is too low, however, there is no guarantee 
and, moreover, no feedback whether the best matching object with respect to the under- 
lying similarity model is returned. In contrast to that, the k-nearest neighbor algorithm 
used in our approach ensures that the k most similar objects are returned. There are no 
objects in the database which are more similar than the retrieved ones. 

The approximation-based similarity model presented in [KSS 97] and [KS 98] ad- 
dresses the retrieval of similar 3D surface segments. These surface segments occur in the 
context of molecular docking prediction where they represent potential docking sites. 
Since the segments are designed to model local portions of 3D surfaces but not to model 
the entire contour of a 3D solid, this technique is not applicable for searching 3D solids 
having similar global shapes. 

1.4 Invariance Properties of Similarity Models 

All the mentioned similarity models incorporate invariance against translation of the 
objects, some of them also include invariance against scaling which is not necessarily 
desired in the context of molecular or CAD databases. With respect to invariance against 
rotations, two approaches can be observed. Some of the similarity models inherently 
support rotational invariance, e.g. by means of the Fourier transform [BKK 97] or the 
algebraic moment invariants [TC 91]. Most of the techniques, however, include a pre- 
processing step that rotates the objects to a normalized orientation, e.g. by a Principal 
Axis Transform. If rotations should be considered nevertheless, the objects may be ro- 
tated artificially by certain angles as suggested in [Ber 97]. For some applications, even- 
tually, rotational invariance may be not required, e.g. if mechanical parts in a CAD 
database are already stored in a standardized orientation. 

An important kind of invariance has not yet be considered in previous work, the 
robustness of similarity models against errors of measurement, calibration, sampling 
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errors, errors of classification of object components, numerical rounding errors, and 
small displacements such as shifts or slight rotations of geometric details. In our model, 
these problems are addressed and may be controlled by the user by specifying and adapt- 
ing a similarity matrix for histogram bins. 

2 A 3D Shape Similarity Model 

In this section, we introduce our 3D shape similarity model by defining the two major 
ingredients: First, the shape histograms as an intuitive and discrete representation of 
complex spatial objects. Second, an adaptable similarity distance function for the shape 
histograms that may take small shifts and rotations into account by using quadratic 
forms. 

2.1 Shape Histograms 

The definition of an appropriate distance function is crucial for the effectiveness of any 
nearest neighbor classifier. A common approach for similarity models is based on the 
paradigm of feature vectors. A feature transform maps a complex object onto a feature 
vector in a multidimensional space. The similarity of two objects is then defined as the 
vicinity of their feature vectors in the feature space. 

We follow this approach by introducing 3D shape histograms as intuitive feature 
vectors. In general, histograms are based on a partitioning of the space in which the 
objects reside, i.e. a complete and disjoint decomposition into cells which correspond to 
the bins of the histograms. The space may be geometric (2D, 3D), thematic (e.g. physi- 
cal or chemical properties), or temporal (modeling the behavior of objects). 

We suggest three techniques for decomposing the space: A shell model, a sector 
model, and a spiderweb model as the combination of the former two (cf. Figure 1). In a 
preprocessing step, a 3D solid is moved to the origin. Thus the models are aligned to the 
center of mass of the solid. 






4 shell bins 12 sector bins 48 combined bins 

Figure 1. Shells and sectors as basic space decompositions for shape histograms. In 
each of the 2D examples, a single bin is marked 



Shell Model. The 3D is decomposed into concentric shells around the center point. This 
representation is particularly independent from a rotation of the objects, i.e. any rotation 
of an object around the center point of the model results in the same histogram. The radii 
of the shells are determined from the extensions of the objects in the database. The 
outermost shell is left unbound in order to cover objects that exceed the size of the 
largest known object. 
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Sector Model. The 3D is decomposed into sectors that emerge from the center point of 
the model. This approach is closely related to the 2D section coding method [BKK 97a]. 
However, the definition and computation of 3D sector histograms is more sophisticated, 
and we define the sectors as follows: Distribute the desired number of points uniformly 
on the surface of a sphere. For this purpose, we use the vertices of regular polyhedrons 
and their recursive refinements. Once the points are distributed, the Voronoi diagram of 
the points immediately defines an appropriate decomposition of the space. Since the 
points are regularly distributed on the sphere, the Voronoi cells meet at the center point 
of the model. For the computation of sector-based shape histograms, we need not to 
materialize the complex Voronoi diagram but simply apply a nearest neighbor search in 
3D since typical number of sectors are not very large. 

Combined Model. The combined model represents more detailed information than 
pure shell models and pure sector models. A simple combination of two fine-grained 3D 
decompositions results in a high dimensionality. However, since the resolution of the 
space decomposition is a parameter in any case, the number of dimensions may easily 
be adapted to the particular application. 

In Figure 2, we illustrate various shape histograms for the example protein, 1 SER-B , 
which is depicted on the left of the figure. In the middle, the various space decomposi- 
tions are indicated schematically, and on the right, the corresponding shape histograms 
are depicted. The top histogram is purely based on shell bins, and the bottom histogram 
is defined by 1 22 sector bins. The histograms in the middle follow the combined model, 
they are defined by 20 shell bins and 6 sector bins, and by 6 shell bins and 20 sector bins, 
respectively. In this example, all the different histograms have approximately the same 
dimension of around 120. Note that the histograms are not built from volume elements 
but from uniformly distributed surface points taken from the molecular surfaces. 

2.2 Shortcomings of the Euclidean Distance 

In order to quantify the dissimilarity of objects, an appropriate distance function of 
feature vectors has to be provided. An obvious solution is to employ the classic Euclid- 
ean distance function which is well-defined for feature spaces of arbitrary dimension. In 
a squared representation, the Euclidean distance of two V-dimensional vectors p and q 
is defined as: 

q) = 1 ^Pi - ‘if = (p-q)-(p- ■ 

However, the Euclidean distance exhibits severe limitations with respect to similari- 
ty measurement. In particular, the individual components of the feature vectors which 
correspond to the dimensions of the feature space are assumed to be independent from 
each other, and no relationships of the components such as substitutability and compens- 
ability may be regarded. The following example demonstrates these shortcomings in 
more detail. 

Let us consider the three objects a, b, and c from Eigure 3. Erom a visual inspection, 
we assess the objects a and b to be more similar than a and c or b and c since the two 
characteristic peaks are located more close together in the objects a and b than in object 
c. However, the peaks of a and b do not overlap the same sectors and, therefore, are 
mapped to distinct histogram bins. The Euclidean distance neglects any relationship of 
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Figure 2. Several 3-D shape histograms of the example protein ISER-B. From top to 
bottom, the number of shells decreases and the number of sectors increases 



the vector components and does not reflect the close similarity of a and b in comparison 
to c. Thus, the three objects count for being equally similar, because their feature vectors 
have the same distance in pairs. 

2.3 Quadratic Form Distance Functions 

An approach to overcome these limitations has been investigated for color histograms in 
the QBIC project (Query by Image Content) at IBM Almaden [FBF-t 94] [HSEh- 95], 
The authors suggest to use quadratic form distance functions which have also been suc- 
cessfully applied to several multimedia database applications [Sei 97] [SK 97] 
[KSS 97] [AKS 98] [KS 98]. A quadratic form distance function is defined in terms of 
a similarity matrix A as follows where the components of the matrix A represent the 
similarity of the components i and j in the underlying vector space. 

dlix, y) = {x-y)-A-{x-y)^ = ^ ^ a.j{x. - y^Kxj - y .) . 

From this definition, it becomes clear that the standard Euclidean distance is a spe- 
cial case of the quadratic form distance which is achieved by using the identity matrix Id 
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Figure 3. Shortcomings of the Euclidean distance. The Euclidean distance of the shape 
histograms does not reflect the similarity that is due to the proximity of neighboring sectors 

as similarity matrix. Analogously, we obtain a weighted Euclidean distance function 
that has the weights (wj, W 2 , by using the diagonal matrix diag(W[,W 2 , as 

similarity matrix. In both cases, the non-diagonal components are set to zero which 
exactly corresponds to the fact that no cross-similarities of the dimensions are assumed. 

The Euclidean distance of two vectors p and q is totally determined, there is no 
parameter which may be tuned. The weighted Euclidean distance is a little more flexible 
because it controls the effect of each vector component onto the overall distance by 
specifying individual weights for the dimensions. A new level of flexibility is supported 
by the general quadratic form distance function. On top of specifying the effect of indi- 
vidual dimensions onto the overall distance, cross-dependencies of the dimensions may 
be handled. 

By using a quadratic form distance function as an adaptable similarity function, the 
problems of the Euclidean distance may be overcome. The neighborhood of bins in 
general and of shells or sectors in particular may be represented as similarity weights in 
the similarity matrix A. The individual similarity weights depend on the distances of the 
corresponding bins. Let us denote by d(i,j) the distance of the cells that correspond to 
the bins i and j. For shells, we define the bin distance to be the difference of the corre- 
sponding shell radii, and for sectors, we use the angle between the sector centers as bin 
distances. When provided with an appropriate bin distance function, we compute the 
corresponding similarity weights by an adapted formula from [HSEh- 95] as follows: 



The parameter <5 controls the global shape of the similarity matrix. The higher < 5 , the 
more similar is the resulting matrix to the identity matrix. In any case, a high value of (5 
yields the matrix to be diagonally dominant. We observed good results for (5 between 1 .0 
and 10. 
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2.4 Invariance Properties of the Models 

In general, the 3D objects are located anywhere in the 3D, and their orientation as well 
as their size can vary arbitrarily. For defining meaningful and applicable similarity mod- 
els, we have to provide invariance for translations, scaling and rotation, depending on 
the application. We can ensure these invariances in three ways, by a preprocessed nor- 
malization step, by the similarity model itself or by both steps. 

In a normalization step, we perform translation and rotation of all objects. After the 
translation which maps the center of mass of each object onto the origin, we perform a 
Principal Axes Transform on the object. The computation for a set of 3D points starts 
with the 3x3 -covariance matrix where the entries are determined by an iteration over 
the coordinates {x, y, z) of all points: 

■ 

The eigenvectors of this covariance matrix represent the principal axes of the origi- 
nal 3D point set, and the eigenvalues indicate the variance of the points in the respective 
direction. As a result of the Principal Axes Transform, all the covariances of the trans- 
formed coordinates vanish. Although this method in general leads to a unique orienta- 
tion of the objects, this does not hold for the exceptional case of an object with at least 
two variances having the same value. In our experiments using the protein database, we 
almost never observed such cases and, therefore, assume a unique orientation of the 
objects. 

The similarity models themselves have inherent invariance properties. Obviously, 
the sector model is invariant against scaling, whereas the shell model trivially has rota- 
tional invariance. Often, no full invariance is desired, instead just small displacement, 
shifts or rotations of geometric details occur in the data, for example caused by errors of 
measurement, sampling or numerical rounding errors. This variation of invariance pre- 
cision which is highly application- and user-dependent is supported by the user-defined 
similarity matrix modeling the appropriate similarity weight for each pair of bins. 

2.5 Extensibility of Histogram Models 

What we have discussed so far is a very flexible and intuitive similarity model for 3D 
objects. However, the distance function of the similarity model is based just on the spa- 
tial attributes of the objects. Frequently, on top of the geometric information, a lot of 
thematic information is used to describe spatial objects. Particularly in protein databas- 
es, the chemical structure and physical properties are important. Examples include atom 
types, residue types, partial charge, hydrophobicity, electrostatic potential among oth- 
ers. A general approach to manage thematic information along with spatial properties is 
provided by combined histograms. Figure 4 demonstrates the basic principle. Assume 
we are given a spatial histogram structure as presented above, and additionally a themat- 
ic histogram structure to be given. A combined histogram structure is immediately ob- 
tained as the Cartesian product of the original structures. 
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Figure 4. Example for a combined thematic and shape histogram for a molecule 



Obviously, this product based approach leads to a tradeoff between a more powerful 
modeling versus a very high dimensionality. An investigation of the efficiency and ef- 
fectiveness as well as the development of new techniques that meet the requirements of 
ultra high dimensional spaces is part of our future research plans. 

3 Efficient Query Processing 

Due to the enormous and still increasing size of modern databases that contain tens and 
hundreds of thousands of molecules, mechanical parts, or medical images, the task of 
efficient query processing becomes more and more important. In the case of quadratic 
form distance functions, the evaluation time of a single database object increases quadrat- 
ically with the dimension. We measured 0.23 milliseconds in the average for 21D histo- 
grams, 6.2 milliseconds for 256D and 1,656 milliseconds in 4,096D space (cf. Figure 5). 
Thus, linearly scanning the overall database is prohibitive. In order to achieve a good 
performance, our system architecture follows the paradigm of multistep query process- 
ing: An index-based filter step produces a set of candidates, and a subsequent refinement 
step performs the expensive exact evaluation of the candidates [Sei 97] [AKS 98]. 

3.1 Optimal Multistep *^-Nearest Neighbor Search 

Whereas the refinement step in a multistep query processor has to ensure the correct- 
ness, i.e. no false hits may be reported as final answers, the filter step is primarily respon- 
sible for the completeness, i.e. no actual result may be missing from the final answers 
and, therefore, from the set of candidates. Figure 6 illustrates the architecture of our 
multistep similarity query processor that fulfills this property [SK 98]. Moreover, as an 
advantage over the related method of [KSFh- 96], our algorithm is proven to be optimal, 
i.e. it produces only the minimum number of candidates. Thus, expensive evaluations of 
unnecessary candidates are avoided, and we observed improvement factors of up to 120 
for the number of candidates and 48 for the overall runtime. 
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dimension of ellipsoid 



Figure 5. Average evaluation time for single ellipsoid distances 

Based on a multidimensional index structure, the filter step performs an incremental 
ranking that reports the objects ordered by their increasing filter distance to the query 
object using an algorithm derived from [HS 95]. The number of accessed index pages is 
minimum as proven in [BBKK 97], and the termination is controlled by the refinement 
step in order to guarantee the minimum number of candidates [SK 98]. Only for the 
exact evaluation in the refinement step, the exact object representation is retrieved from 
the object server. 

In order to guarantee no false dismissals caused by the filter step, the filter distance 
function <7^ has to be a lower bound of the exact object distance function that is evalu- 

ated in the refinement step. That is, for all database objects p and all query objects q, the 
following inequality has to hold: 

dfp, q) ^ dgip, q ) . 

3.2 Reduction of Dimensionality for Quadratic Forms 

A common approach to manage objects in high-dimensional spaces is to apply tech- 
niques to reduce the dimensionality. The objects in the reduced space are then typically 
managed by any multidimensional index structure [GG 98]. The typical use of common 
linear reduction techniques such as the Principal Components Analysis (PCA) or Kar- 
hunen-Loeve Transform (KLT), the Discrete Fourier or Cosine Transform (DFT, DCT), 
the Similarity Matrix Decomposition [FtSE-i- 95] or the Feature Subselection [FBF+ 94] 
includes a clipping of the high-dimensional vectors such that the Euclidean distance in 
the reduced space is always a lower bound of the Euclidean distance in the high-dimen- 
sional space. 

The question arises whether these approved techniques are applicable to general 
quadratic form distance functions. Eortunately, the answer is positive; an algorithm to 
reduce the similarity matrix from a high-dimensional space down to a low-dimensional 
space according to a given reduction technique was developed in the context of multime- 
dia databases for color histograms [SK97] and shapes in 2D images [AKS 98]. The 
method guarantees three important properties: Eirst, the reduced distance function is a 
lower bound of the given high-dimensional distance function. Obviously, this criterion 
had to be a design goal in order to meet the requirements of multistep similarity query 
processing. Second, the reduced distance function again is a quadratic form and, there- 




218 



Mihael Ankerst et al. 




C Tquerv^ 
Filter step ^ 

u 



Objects 



^ M ^fleffnemenf sfep^ 



< Cre^lt^ 



Figure 6. Multistep similarity query processing 



fore, the complexity of the query model is not increased while decreasing the dimension 
of the space. Third, the reduced distance function is the greatest of all lower-bounding 
distance functions in the reduced space. As an important implication of this property, the 
selectivity in the filter step is optimal: In the reduced space, no lower-bounding distance 
function is able to produce a smaller set of candidates than the resulting quadratic form. 

3.3 Ellipsoid Queries on Multidimensional Index Structures 

The task remains to efficiently support A:-nearest neighbor search and incremental rank- 
ing for quadratic form distance functions in low-dimensional spaces. Due to the geomet- 
ric shape of the query range, a quadratic form-based similarity query is called an ellip- 
soid query [Sei 97]. An efficient algorithm for ellipsoid query processing on 
multidimensional index structures was developed in the context of approximation-based 
similarity search for 3-D surface segments [KSS 97] [KS 98]. The method is designed 
for index structures that use a hierarchical directory based on rectilinear bounding boxes 
such as the R-tree [Gut 84], R-n-tree [SRF 87], R*-tree [BKSS 90], X-tree [BKK 96] 
[BBBh- 97], and Quadtrees among others; surveys are provided in [Sam 90] [GG 98]. 
The technique is based on measuring the minimum quadratic form distance of a query 
point to the hyperrectangles in the directory. Recently, an improvement by using conser- 
vative approximations has been suggested [ABKS 98]. 

An important property of the method is its flexibility with respect to the similarity 
matrix. The matrix does not have to be available at index creation time and, therefore, 
may be considered as a query parameter. Thus, the users may specify and adapt the 
similarity weights in the matrix even at query time according to their individual prefer- 
ences or to the specific requirements of the application. In any case, the same precom- 
puted index may be used. This property is the major advantage compared to previous 
solutions that were developed in the context of color histogram indexing in the QBIC 
project [FBF-t 94] [HSEh- 95] where the index depends on a specific similarity matrix 
that has to be given in advance. 

The cost model of [BBKK 97] provides a theoretical analysis of the performance 
deterioration for multidimensional index structures with increasing dimensionality. An 
investigation in [WSB 98] results in the recommendation to use an accelerated sequen- 
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tial scan, and the VA-File was developed following this paradigm. However, the analy- 
ses are based on the Lj (Euclidean distance), Lj, and norms that may be evaluated in 
linear time depending on the dimension, and the results require careful reviewing and 
experimental evaluation when applied to quadratic form distance functions. Even if the 
index is substituted by a sequential scan, the filter-refinement architecture will still be 
necessary due to the high cost of exact quadratic form evaluations. 

4 Experimental Evaluation 

We implemented the algorithms in C-H- and ran the experiments on our HP Cl 60 work- 
stations under HP-UX 10.20. Eor single queries, we also implemented a HTML/Java 
interface that supports query specification and visualization of the results. The atomic 
coordinates of the 3D protein structures are taken from the Brookhaven Protein Data 
Bank (PDB) [BKW-t 77]. Eor the computation of shape histograms, we use a represen- 
tation of the molecules by surface points as it is required for several interesting problems 
such as the molecular docking prediction [SK 95]. The reduced feature vectors for the 
filter step are managed by an X-tree [BKK 96] of dimension 10. 

The similarity matrices are computed by an adapted formula from [HSEh- 95] where 
the similarity weights of bin i and j are defined as a-j = . The distance 

is equal to the difference of the corresponding shell radii in the shell model and is given 
by the angle between the sector axes in the sector model. In the combined model, the 
shell distance d^i^n{i,j) and the sector distance of the bins i and j are com- 
posed by using the Euclidean distance formula d^^^^(i,j) = Jd^i^^nUJ) + . 

We experimented with several values of the parameter C but did not observe significant 
changes in the accuracy, so we set the parameter C equal to 10 for the following evalua- 
tions [Kas 98]. 

4.1 Basic Similarity Search 

In order to illustrate the applicability of the similarity model, we demonstrate the retriev- 
al of the members of a known family. As a typical example, we chose the seven Seryl- 
tRNA Synthetase molecules from our database that are classified by CATH [OMJh- 97] 
to the same family. The diagram in Eigure 7 presents the result using shape histograms 
for 6 shells and 20 sectors. The seven members of the Seryl family rank on the top seven 
positions among the 5,000 molecules of the database. In particular, the similarity dis- 
tance noticeable increases for 2PEK-A, the first non-Seryl protein in the ranking order. 

4.2 Classification by Shape Similarity 

Eor the classification experiments, we restricted our database to the proteins that are also 
contained in the ESSP database [HS 94] and took care that for every class, at least two 
molecules are available. Erom this preprocessing, we obtained 3,422 proteins assigned 
to 28 1 classes. The classes contain between 2 and 1 85 molecules. In order to measure the 
classification accuracy, we performed leave-one-out experiments for various histogram 
models. Eor each molecule in the database, the nearest neighbor classification was de- 
termined after removing that element from the database. Technically, we always used 
the same database and selected the second nearest neighbor since the query object itself 
is reported to be its own nearest neighbor. 
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Figure 7. Similarity ranking for the Seryl-tRNA Synthetases ISER-B. The diagram depicts 
the similarity distances of the 12 nearest neighbors to the query protein ISER-B in ascending 
order. The illustration of the top eight molecules demonstrates the close similarity within the 
family, and the dissimilarity to the first non-Seryl protein in the ranking 



Figure 8 demonstrates the results for histograms based on 12 shells, 20 sectors, and 
the combination of them. Obviously, the more fine-grained spiderweb model yields the 
best classification accuracy of 91.5 percent (top diagram), but even for the coarse sector 
histograms, a noticeable accuracy of 87.3 percent is achieved. These results compete 
with the accuracy of available protein classification systems such as CATH [OMJ 97] 
where also more than 90% of the class labels are predicted correctly. Whereas in CATH 
only four different class labels are used for the automatic classification, our experiments 
are based on a variety of 28 1 class labels. 

The average overall runtime for a single query reflects the larger dimension of the 
combined model. It ranges from 0.05s for 12 shells over 0.2s for 20 sectors up to 1.42s 
for the combination. This runtime performance in the range of tens to thousands of 
milliseconds is a progress compared to established biomolecular systems for which que- 
ry response times in the range of minutes and hours are reported [HS 98]. 

Figure 9 illustrates the effect of simply increasing the dimension of the model without 
combining orthogonal space partitionings. Again we observed the expected result that 
more information yields better accuracy. When increasing the histogram dimension by 
a factor of 10, the accuracy increases from 71.6 to 88.1 for the shell model, and from 
87.3 to 91 .6 for the sector model. For the task of classification, the increased granularity 
results in a better separation of the class members from the other objects. Obviously, the 
tradeoff for this gain is a larger space requirement and the increase of the runtime due to 
the high dimensionality. We plan to develop a cost model for obtaining the optimal 
number of bins in order to produce both accurate and fast results. 
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Figure 8. Classification accuracy {top diagram) and average runtime of query processing 
(bottom diagram) for histograms with 12 shells, 20 sectors, and their combination 



In these experiments, we achieve the same accuracy for a fine-grained 122D sector 
model as we obtained from the 12 x 20 (240D) combined model. One may wonder why 
the combined model does not lead to the best accuracy. Although all proposed models 
yield good results in terms of accuracy and runtime, the sector model turns out to be 
most suitable for the tested data. One reason for this data-dependent result is that the 
decomposition of the 3D objects is computed for uniform sectors or equidistant shells. 
To reveal the properties of the space decompositions, we computed the standard devia- 
tion for each bin over all histograms. We present the observations by bar diagrams where 
the height of a bar represents the value of the standard deviation for the corresponding 
dimension. 
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Figure 9. The accuracy increases with increasing granularity of the space partitioning for both, 
shell and sector histograms 
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Figure 10. Standard deviations of the bins for the 12D {top) and 120D {bottom) shell models 

Figure 10 demonstrates that for the shell model, the values of the standard deviations 
are distributed very unbalanced. For the 12D model, the highest standard deviation oc- 
curs in the shells 2 to 4 that contain the large majority of the surface points, and a low 
deviation is observed for the shells 7 to 12. For the 120D model, significant deviations 
occur only for the shells 3 to 60; there is only a low variance in the number of points for 
the other shells which are populated very sparse. Therefore, the corresponding histo- 
gram bins do not contribute to distinguish between different molecules but just increase 
the dimension and, as a consequence, the runtime becomes worse. 

Figure 1 1 depicts the standard deviations for the two sector models, 20D and 122D. 
For every histogram bin, the standard deviation is high, and, therefore, all dimensions 
contribute to the distinction of different molecules. For the combined model, the stan- 
dard deviations of the 240 bins are illustrated in Figure 12. These 240 bins result from 
the decomposition of the 3D space into 12 shells and 20 sectors. Therefore, the periodic 
pattern in the standard deviation reflects the previous observations for the shell and the 
sector models. 
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Figure 11 . Standard deviations of the bins for the 20D {top) and 122D {bottom) sector models 
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Figure 12. Standard deviations of the combined histogram bins. For each of the 20 sectors, the 
characteristic shape of the 12-shell histogram can be recognized 

As a way to improve the properties of the shell model, we plan to use a more appro- 
priate partitioning of the space. Instead of using equidistant radii to define the decompo- 
sition, the shell radii could be based on quantiles that are obtained from the distribution 
of the surface points in the space. As a consequence, this approach will also improve the 
effectiveness of the combined model. 

5 Conclusions 

In this paper, we presented a new intuitive and flexible model for shape similarity search 
of 3D solids. As a specific feature transform, 3D shapes are represented by using shape 
histograms for which several partitionings of the space are possible. This histogram 
model naturally is extensible to thematic attributes such as physical and chemical prop- 
erties. In order to account for errors of measurement, sampling, numerical rounding etc., 
quadratic form distance functions are used that are able to take small displacements and 
rotations into account. For efficient query processing, a filter-refinement architecture is 
used that supports similarity query processing based on high-dimensional feature vec- 
tors and quadratic form distance functions. The experiments demonstrate both, the high 
classification accuracy of our similarity model, and the good performance of the under- 
lying query processor. 

The improvement of the space decomposition by using a quantile based method, the 
development of a cost model for determining the optimal number of bins, and the inves- 
tigation of thematically extended histogram models are plans for our future work al- 
ready mentioned so far. In addition, we will include a visualization of shape histograms 
as a Java applet in order to provide an explanation component for the classification 
system. This is an important issue since any notion of similarity is subjective in a high 
degree, and the users want to have as much feedback as possible concerning the behavior 
of the system depending on their queries and input parameters. Furthermore, the confi- 
dence of the users in an automatic classification increases with the reproduc ability of the 
decision by the user which can be enhanced by visualization methods. A more concep- 
tual future work addresses the optimization of the space partitioning and the geometry 
of the cells which form the histogram bins. Both the number as well as the geometry of 
the cells affect the effectiveness and also the efficiency of similarity search and classifi- 
cation. 
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Abstract. We propose a new multi-way spatial join algorithm called 
M-way R-tree join which synchronously traverses M R-trees. The M- 
way R-tree join can be considered as a generalization of the 2-way R-tree 
join. Although a generalization of the 2-way R-tree join has recently been 
studied, it did not properly take into account the optimization techniques 
of the original algorithm. Here, we extend these optimization techniques 
for M-way joins. Since the join ordering was considered to be important in 
the M-way join literature (e.g., relational join), we especially consider the 
ordering of the search space restriction and the plane sweep. Additionally, 
we introduce indirect predicates in the M-way join and propose a further 
optimization technique to improve the performance of the M-way R-tree 
join. Through experiments using real data, we show that our optimization 
techniques signihcantly improve the performance of the M-way spatial 
join. 



1 Introduction 

The spatial join is a common spatial query type which requires a high processing 
cost due to the high complexity and large volume of spatial data. Therefore, the 
spatial join is processed in two steps (the filter step and the refinement step) 
to reduce the overall processing cost 1 1 4lbj . Many 2-way spatial join methods 
have been published in the literature: the join using Z-order elements jl 4) . the 
join using R-trees (called R-tree join) 0, the seeded tree join (STJ) [lOj, the 
spatial hash join (SHJ) [HJ, the partition based spatial merge join (PBSM) [2Dj, 
the size separation spatial join (S^J) jOj, the scalable sweeping-based spatial join 
(SSSJ) P and the slot index spatial join (SISJ) [T^ . However, there has been 
little research on the multi-way spatial join The M-way (M>2) spatial join 
combines M spatial relations using M-1 or more spatial predicate^!. An example 
of a 3-way spatial join is “Find all buildings which are adjacent to roads that 

* The work reported here was performed while Guang-Ho Cha was at Tongmyong 
University of Information Technology, Korea 
^ If the number of spatial predicates is less than M-1, the join necessarily includes 
cartesian products, in which case we regard the join not as one spatial join but as 
several spatial joins. 



R.H. Giiting, D. Papadias, F. Lochovsky (Eds.): SSD’99, LNCS 1651, pp. 229-1223 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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intersect with boundaries of districts.” An M-way spatial join can be modeled 
by a query graph whose nodes represent relations and edges represent spatial 
predicates. 

One way to process M-way spatial joins is as a sequence of 2-way joins H2|. 
Another possible way, when all join attributes have spatial indexes and each join 
attribute is shared among the associated join predicate^, is to combine the filter 
and refinement steps respectively as follows: 

(1) Scan the relevant indexes synchronously for all join attributes to obtain a 
set of spatial object identifier tuples. 

(2) Read objects for object identifier (oid) tuples obtained from Step (1), and 
perform an M-way spatial join using geometric computation algorithms. 

Step (1) is called combined filtering and Step (2) combined refinement in 1 1 7] . 
Especially when the R-trees are used in Step (1), the combined filtering is called 
M-way R-tree join which is the scope of this paper. The M-way R-tree join is 
also called synchronous traversal (ST) in [Ej. An advantage of the combined 
filtering is that it removes unnecessary refinement operations for some object 
pairs. For example, let Figure [D be an MBR (Minimum Bounding Rectangle) 
combination of spatial objects for the above query. Let a, b and c be instances of 
the relations buildings, roads and boundaries, respectively. If it is processed by a 
sequence of 2-way joins and the evaluation order is determined to be (a, b, c) by 
a query optimizer, the refinement operation between a and b will be performed 
unnecessarily. However, the combined filtering can avoid this situation. 



b 




a 




Fig. 1. An MBR combination in a 3- way join 



The M-way R-tree join can be considered as a generalization of the 2-way 
R-tree join of m and does not create intermediate results. Although a gener- 
alization of the 2-way R-tree join called multi-level forward checking (MFC) has 
recently been studied it did not properly take into account the optimiza- 

tion techniques of the original 2- way R-tree join. 

The main contributions of this paper are as follows: First, we generalize the 2- 
way R-tree join to consider the order of search space restrictions and plane sweeps 
because the join ordering was considered to be important in the M-way join 
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In this case, only one spatial predicate per relation participates in the join. 



Multi-way Spatial Joins Using R- Trees 231 



literature (e.g., relational join) |^. Second, we introduce indirect predicates in 
the M-way spatial join and propose a further optimization technique to improve 
the performance of the M-way R-tree join. Through experiments, we show that 
our optimization techniques significantly improve the performance of the M-way 
spatial join (especially the filter step) against MFC. Additionally, we find that 
the M-way R-tree join becomes CPU-bound as M increases. 

The remainder of this paper is organized as follows: Section 0 provides some 
background by briefly explaining the 2-way R-tree join and the state-of-the-art 
M-way spatial joins using R-trees. In Section 0 we propose an algorithm of 
the M-way R-tree join, which considers the ordering of search space restrictions 
and plane sweeps, as a new generalization of the 2-way R-tree join, and further 
improve the performance of the M-way R-tree join using the concept of indi- 
rect predicates. In Section 0, we present some experiments for the performance 
analysis of our algorithms using the TIGER data m- Finally in Section 0 we 
conclude this paper and suggest some future studies. 

2 Background 

2.1 2- Way Spatial Joins Using R- Trees 

Assuming that R-trees iI2| exist for both join inputs, a join algorithm which 
synchronously traverses both R-trees using depth-first search was proposed 0. 
The basic idea of the algorithm is as follows: First, it reads the root nodes of 
the R-trees and checks if the rectangles of entries of both nodes mutually inter- 
sect. Next, only for intersected entry pairs, it traverses the child node pairs by 
depth-first search and continuously checks the intersection between the rectan- 
gles of entries of both child nodes. In this way, if the algorithm reaches the leaf 
nodes, it outputs the intersected entry pairs and backtracks to the parent nodes. 
Two optimization techniques, called search space restriction and plane sweep, 
are used to reduce the CPU time. The search space restriction heuristic picks 
out the entries whose rectangles do not intersect with the rectangle enclosing the 
other node, before the intersection is actually checked between the rectangles of 
entries of both nodes. The plane sweep first sorts the rectangles of entries of 
both nodes for one axis, and then goes forward along the sweep line and checks 
the intersection for the other axis. The algorithm using the above techniques 
is shown below: (We skip the detailed algorithm for SortedIntersectionTest in 
Step (6) due to space limitation. Refer to 0 for details.) 

RtreeJoin (Rtree_Mode R, S) 

(1) FOR all Ai G R DO 

(2) IF Ai.rect n S.rect == 0 THEN R = R-{J5i}; /* space restriction 
on R */ 

(3) FOR all Ej G S DO 

(4) IF Aj.rect R R.rect == 0 THEN S = S-{Ej}; /* space restriction 
on S */ 

(5) Sort(R); Sort(S); 
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(6) SortedIntersectionTest (R, S, Seq) ; /* plane sweep */ 

(7) FOR i = 1 TO ||Seq|| DO 

(8) (Er,Es) = Seq[i] ; 

(9) IF R is a leaf page THEN /* S is also a leaf page */ 

(10) output (Er,Es); 

(11) ELSE 

(12) ReadPage(_Bi{ .ref ) ; ReadPage (i?s •I'ef) ; 

(13) RtreeJoin (En.ref, iJ^.ref); 

END RtreeJoin 

Additionally, the algorithm applied the page pinning technique for I/O opti- 
mization. The algorithm used only a local optimization policy to fetch the child 
node pairs. Later, a global optimization algorithm by breadth-first search was 
proposed 0. In this paper, we call both of the join algorithms 2-way R-tree join 
or simply R-tree join. When R-trees exist for both join inputs, it has been shown 
that the R-tree join is most efficient 

2.2 State-of-the-Art M-Way Spatial Joins Using R- Trees 

In a recent study, two methods called multi-level forward checking (MFC) and 
window reduction (WR) were proposed to process structural queries for image 
similarity retrieval Later, they were applied to the multi-way spatial join 
m- MFC and WR were motivated by a close correspondence between multi-way 
spatial joins and constraint satisfaction problems (CSPs). A multi-way spatial 
join can be represented in terms of a binary CSP m-- 

— A set of n variables, V\,V 2 , ■ ■ ■ , each corresponding to a dataset. 

— For each variable Vi, a domain Di which consists of the data in tree R. 

— For each pair of variables a binary constraint Qij corresponding to 

a spatial predicate. 

If Qij{Ei,x, Ej,y) = TRUE, then the assignment {vi = Ei^x,Vj = Ej^y} is consis- 
tent. A solution is an n-tuple r = . . . , Ei^x, ■ ■ ■ , Ej^y , . . . , En^z) such that 

Vi,j, {ui = Ei^x,Vj = Ej^y} is consistent. In the sequel, we use the terms vari- 
able/dataset/relation and constraint/predicate/join condition interchangeably. 

1) Multi-level Forward Checking MFC is a kind of ST algorithms which 
synchronously traverses n R-trees as follows: It starts from the root nodes of 
n R-trees and checks all predicates for each n-combination (called entry-tuple) 
from the entries of the nodes. If an entry-tuple satisfies all the predicates, one of 
the following occurrs: If the node-tuple (an n-combination of the R-tree nodes) is 
in the intermediate level, the algorithm is recursively called for the child node- 
tuple pointed by the entry-tuple. Otherwise, i.e., if the node-tuple consists of 
leaf nodes, the algorithm outputs the entry-tuple and processes the next entry- 
tuple. If an entry-tuple does not satisfy at least one predicate, the entry-tuple is 
pruned. MFC was considered as a generalization of the 2-way R-tree join. 
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At each R-tree level, MFC applies forward checking (FC), which is known to 
be one of the most effective algorithms for solving CSP, to find the entry-tuples 
satisfying the predicates. FC maintains an n * n * C domain table (n: number 
of variables, C: the maximum number of entries of an R-tree node) in main 
memory. domain[i][j]{0 <i,j < n) is a subset of an R-tree node Nj. FC works 
as follows First, domain[0][j] is initialized to an R-tree node Nj for all j. 
When a variable vq is assigned a value Uk, domain\l][j] is computed for each 
remaining Uj, by including only values ui G domainf)][j] such that Qoj{uk,ui) = 
TRUE. In general, if Uk is the current value of Vi, domain[i + l][j] is the subset of 
domain[i] [j] which is valid w.r.t. Qij and Uk- In this way, at each instantiation the 
domain of each future variable (un-instantiated variable) continuously shrinks. 
FC outputs a solution whenever the last variable is given a value. When the 
domain of the current variable is exhausted, the algorithm backtracks to the 
previous one. 

For ordering of variable instantiations, MFC applies the dynamic variable 
ordering (DVO), which is also mainly used in CSPs. DVO dynamically reorders 
the future variables after each instantiation so that the variable with the min- 
imum domain size becomes the next variable. Additionally, MFC adopts the 
search space restriction technique to improve performance. A slightly modified 
version of the space restriction algorithm used in HS| is shown below: 

BOOLEAN SpaceRestriction_l (Query .graph Q [] [] , RtreeJJodes N[]) 

(1) FOR i=0 TO n-1 DO 

(2) ReadPage (N[i]); 

(3) FOR all Afc G N[i] DO 

(4) FOR j=0 TO n-1, i/j DO 

(5) IF Q[i][j] == TRUE AND Afe.rect n N[j].rect == 0 THEN 

(6) N[i] = N[i]-{Afe}; 

(7) BREAK; 

(8) IF N[i]==0 THEN RETURN FALSE; 

(9) RETURN TRUE; 

END SpaceRestriction.l 

We do not adopt MFC for the following reasons: First, MFC does not apply 
the plane sweep technique, which is fairly efficient in the rectangle intersection 
problem Em . but uses FC-DVO which is just a special form of the nested 
loop. Second, during the space restriction, MFC does not consider the space 
restriction order among n R-tree nodes, i.e., which node should be checked first. 
In Section 01 we propose a new generalization of the 2-way R-tree join which 
considers both the space restriction ordering and the plane sweep technique. 



2) Window Reduction WR maintains an n * n domain window (instead of 
a 3D domain table) that encloses all potential values for each variable. When a 
variable is instantiated, a domain window for each future variable is shrunk to 
the intersection between the newly computed window according to the current 
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variable instantiation and existing domain window. For the instantiation of the 
current variable, a window query is performed using the current domain window. 
In WR, the DVO technique was also applied to reorder the future variables, i.e., 
the future variable with the smallest domain window becomes the next variable 
to be examined. WR was considered as a special form of the indexed nested loop 
join. However it does not generate intermediate results. WR must essentially 
search the whole space in order to instantiate the first variable. To avoid the 
blind instantiation for the first variable, a hybrid technique called join window 
reduction (JWR) was proposed JWR applies the R-tree join for the first 
pair of variables and then WR for the rest of the variables. 

In ^21 1 ^ slightly different WR algorithm was proposed for the multi-way 
intersection join. In that algorithm, the instantiation order of variables is pre- 
determined according to an optimization method. As a query window, for acyclic 
queries (tree topology), the rectangle of the variable directly connected to the 
current variable among instantiated variables becomes the query window. For 
complete queries (clique topology), the common intersected rectangle of all in- 
stantiated variables becomes the query window. In our implementation and ex- 
periment, regardless of query types, among instantiated variables which are con- 
nected to the current variable, one whose value has the smallest rectangle was 
selected and the rectangle becomes a query window for the next variable instan- 
tiation. 



3 New Methods for M-Way Spatial Joins Using R- Trees 

3.1 A New Generalization of the 2- Way R-Tree Join 

In this section, we propose a new M-way join algorithm which extends both 
the search space restriction and the plane sweep optimization techniques of the 
2-way R-tree join. We emphasize the ordering of both optimization techniques, 
assuming only intersect (not disjoint) as a join predicate. 



1) Search Space Restriction Algorithm SpaceRestriction-1 does not 
consider ordering among M R-tree nodes. If no entry of an R-tree node passes 
over the space restriction, we do not have to check other nodes. Especially in 
an incomplete join (no join predicate between some variables), the possibility 
that no entry of an R-tree node may pass over the space restriction is high. In 
such a case. Algorithm SpaceRestrictionA may result in unnecessary reading of 
other nodes. Therefore, the space restriction order of the R-tree nodes becomes 
important. For example, FigureO shows an MBR intersection between interme- 
diate nodes of the R-trees for a 4-way spatial join “X intersect Y and Y intersect 
Z and Z intersect W.” Since no entry of node B simultaneously intersects with 
nodes A and C, the intermediate node-tuple {A, B,C, D) cannot pass over the 
space restriction and becomes a false hit. If the space restriction is performed 
first on node B, we do not have to check other nodes A, C and D and can save 
the I/O and CPU time. 
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Fig. 2. Intersection between intermediate nodes of R-trees 



We will explain the space restriction ordering (SRO) in the context of the 
query graph. For SRO, we use the following two metrics per node Ni, 0<i<n-l: 

(1) normalized common rectangle area (NCRA): the area of the common in- 
tersection of Ni and its adjacent nodes divided by the area of the rect- 
angle of Ni- Formally, for all Nj where i = j or Qij = TRUE, 0<j<n-l, 
area(fj Nj.rect)/aie&{Ni.rect). 

(2) maximum inter-rectangle distance (MIRD): the sum of squares of the maxi- 
mum of distances per axis between the nodes adjacent to N^. Formally, for 
all Nj and Nk where Qij = TRUE and Qik = TRUE, 0<j, fc<n-l, j ^ k, 

(^ma,x{x-dist{Nj, Nk)}j (^ma,x{y.dist{Nj, Nk)}^ ■ 

Using the above two metrics, we perform SRO on the basis of the following 
criteria: 

(1) Choose a node with the minimum NCRA. 

(2) If the minimum NCRA is zero for more than one node, choose a node such 
that MIRD is maximal. 

In Figure El the common intersected rectangles for each node are A.rect C 
B.rect, A.rect n B.rect C C.rect, B.rect D C.rect C D.rect and C.rect n D.rect. 
Since nodes B and C have zero NCRA, these two nodes are selected by Crite- 
ria (1). Then, since MIRD (only between A and C in this case) of node B is longer 
than MIRD (between B and D) of node C, we perform the space restriction for 
node B first by Criteria (2). 

In Metric (I), the reason we use the normalized area instead of the (absolute) 
common intersected rectangle area (CRA) is to choose a node which has large 
dead space (If CRAs are the same, a larger node is more likely to have more dead 
space). The dead space of the MBR of an intermediate node may be influenced 
by many factors such as the number of entries, the distribution of the rectangles 



236 



Ho-Hyun Park, Guang-Ho Cha, and Chin- Wan Chung 



of the entries, the density of the rectangles of the entries, and the MBR size of 
the node. If the other conditions are fixed, the smaller the number of entries of 
an intermediate node the more dead space the MBR of an intermediate node 
may have. Skewed distributions, low density and large MBRs may lead to large 
dead space. However, we cannot know the above characteristics except the MBR 
size unless we visit the node. Therefore, we choose only the MBR size among the 
above characteristics. We expect that NCRA behaves better than CRA especially 
in the complete query graph because CRAs of all nodes are always the same. 

The time complexity of SRO is as follows: It takes 0{M) time to compute 
NCRA and MIRD per node. Therefore, it takes 0{M'^) time for all nodes. For 
sorting NCRA and MIRD, it takes 0{M log 2 M) time. Therefore, the overall 
time complexity is Algorithm SpaceRestriction-2 is identical to Algo- 

rithm SpaceRestrictiori-1 but considers ordering of nodes according to the above 
criteria. 



2) Plane Sweep In MFC, FC-DVO was used in a node-tuple join because it 
was known to be efficient in CSPs. However, the plane sweep algorithm was also 
known to be fairly efficient in the rectangle intersection problem m Therefore, 
we use the plane sweep as the second optimization technique rather than FC- 
DVO. In the 2-way join, the plane sweep algorithm is applied only once. In the 
M-way join, however, the plane sweep algorithm must be applied multiple times 
because there are M variables and at least M-1 predicates. In this case, the 
ordering of plane sweeps among R-tree nodes becomes important. 

Our plane sweep ordering (PSO) performs as follows: In PSO, we call the 
evaluated nodes inner nodes and the un-evaluated nodes outer nodes. In the 
following, the cardinality of a node is the number of entries in the node, and the 
degree of a node is the number of edges (i.e., the number of predicates) incident 
on the node. Before PSO starts, all R-tree nodes are initialized to outer nodes. 

(1) Choose the first two connected nodes whose sum of the cardinalities / the 
maximal degree between the two nodes is minimal. 

(2) Apply plane sweep between the selected two nodes and make the two nodes 
inner nodes. 

(3) Choose an outer node which is adjacent to one or more inner nodes such 
that cardinality / degree is minimal. 

(4) Choose an inner node which is adjacent to the selected outer node and whose 
cardinality is minimal. 

(5) Apply plane sweep between the selected inner node and the selected outer 
node. 

(6) Check additional predicates, if any, between the selected outer node and 
other inner nodes. 

(7) Make the selected outer node an inner node. 

(8) Stop if all nodes are inner nodes, otherwise go to Step (3). 

In Step (I) and Step (3), the reason we divide the cardinality by the degree 
is because the more the number of predicates is, the smaller the intermediate 
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result size may be. The time complexity of PSO excluding actual plane sweeps 
is as follows: It takes 0{M^) time to choose the first two nodes. And, for the 
ordering of the rest variables, it also takes O(M^) time. Therefore, the overall 
time complexity of PSO is 0{M^). 

The direct application of PSO generates intermediate results M-2 times m- 
Since the number of solutions in a node-tuple join can be up to (C: the 
maximum number of entries of an R-tree node) for the worst case, we need a 
main memory buffer which can store tuples per R-tree level. For example, if 
the node size is 2048 bytes and the entry size is 20 bytes, C and are about 
100 and 100® respectively in a 5-way join. Although a much smaller buffer will 
be sufficient in general, this is a tremendous amount of main memory for the 
worst case. In order to solve this main memory problem, we can use pipelining. 

For both the plane sweep and pipelining, we use M buffers {Seq\\ in Algorithm 
MwayRtreeJoinA) each of which holds intermediate entry-tuples. The buffer size 
is determined according to the main memory size. We apply plane sweep between 
the first two nodes selected by PSO. The intermediate entry-tuples produced by 
the first plane sweep are accumulated in Seg)!]. If Seq[V\ is full or all entries in 
the first two nodes are evaluated, we recursively call the plane sweep algorithm 
taking Seq[l\ and the next selected outer node as parameters. In general, the 
plane sweep between Seq[m] and an outer node accumulates the intermediate 
result to S'eg[m-|-1]. If plane sweep is called for the last outer node, the algorithm 
backtracks to the previous one. In PSO with pipelining, the actual ordering is 
determined once per node-tuple join. The M-way R-tree join algorithm using 
PSO with pipelining is shown below: 

MwayRtreeJoin_l (Query .graph Q [] [] , RtreeJJodes N[]) 

(1) IF NOT SpaceRestriction_2(Q[] [] , N[]) THEN RETURN; 

(2) PSD (Q[][], N[], outer .order [] , inner .order []) ; 

(3) i = outer . order [0] ; j = inner .order [0] ; 

(4) SeqfO] = N[j] ; 

(5) FOR k=0 TO n-1, DO Sort (N[k]); /* sort all outer nodes */ 

(6) PipelinedPlaneSweep (Q[][], N[], Seq[] , i, j, 0); 

END MwayRtreeJoin.l 

PipelinedPlaneSweep (Query .graph Q [] [] , Rtree JJodes N [], Entry .Tuple3uf 
Seq [] , int i , int j , int m) 

(1) Sort (Seqfm]); 

(2) SortedIntersectionTest.l (N[i] , Seqfm] , j , Seq[m+1] ) ; /* plane sweep + 
additional predicate checking until Seq[m+1] is full or all entries 
in N[i] and Seqfm] are evaluated */ 

(3) IF m == n-2 THEN /* the last outer node is evaluated */ 

(4) FOR all Tfc € Seqfm+1] DO 

(5) IF all Nfl] are leaf nodes, 0<l<n-l THEN 

(6) output Tfc ; 

(7) ELSE /* all tree heights are equal */ 

(8) MwayRtreeJoin.l (Qf]f], Tfc.reff]); /* go downward */ 
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(9) ELSE 

(10) i = outer_order [m+1] ; j = inner_order [m+1] ; 

(11) PipelinedPlaneSweep (Q[][], N[], Seq[] , i, j, m+1); 

(12) IF all entries in N[i] and Seq[m] are evaluated THEN 

(13) RETURN; 

(14) ELSE 

(15) empty Seq[m+1] ; 

(16) goto Step (2) ; 

END PipelinedPlaneSweep 

In Algorithm MwayRtreeJoiri-1, SortedIntersectionTest-1 is the same as Sorte- 
dlntersectionTest in Algorithm RtreeJoin except for the following: First, one in- 
put is a sequence of entry-tuples. Second, additional predicate checks are done 
between the selected outer node and the non-selected inner nodes. Third, when 
Seq[m + 1] is full, SortedIntersectionTest-1 exits and the status of both loop 
counter^ in the algorithm is saved for the next call. In our implementation of 
PSO, we did not use pipelining because all intermediate results fitted in main 
memory. 

3.2 Consideration of Indirect Predicates 

The maximum number of possible predicates in the M-way spatial join is M*(M- 
l)/2, i.e., all relation pairs have join conditions. We call such a join complete. 
If a join is not complete, i.e., the number of predicates is less than M*(M-l)/2, 
the join is incomplete. 

As it was pointed out in eg, the M-way R-tree join may generate many false 
intersections in intermediate levels. As we can see in Figure |3, especially in an 
incomplete join, the possibility of a false intersection is high. In this case, if we 
can detect the false intersections before visiting the intermediate node-tuple, we 
can further reduce I/O and CPU time. For example, if we know in advance that 
no entry of node B can simultaneously intersect nodes A and C in Figure El we 
can avoid reading node B and checking the intersection between all entries of 
node B and other nodes (A and C) during space restriction. In this section, we 
propose a technique which detects a false intersection in intermediate levels of 
R-trees before visiting the node-tuple. 

1) Indirect Predicates In a query “X intersect Y and Y intersect Z and 
Z intersect W” like the one in Figure El it seems that there is no relationship 
between X and Z (or between Y and W, or between X and W). However, for a 
data tuple (i.e., a tuple of entries from leaf nodes) (a, b, c, d) which satisfies the 
query, x-dist{a,c) < bx (or x-dist(b,d) < Cx, or x-dist{a, d) < bx + Cx) must be 
satisfied on x-axis {bx represents x-length for a data MBR b). The same condi- 
tion holds on y-axis. Consequently, for the data tuple {a,b,c,d), xjlist{a,c) < 
max{6ja; I bj S dom{Y)} (or X-dist{b,d) < m.ax{ckx \ Ck G dom{Z)}, or 
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Two internal loops exist in SortedIntersectionTest 0. 
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X-dist{a,d) < max{6^a;} -|-max{cfca;}) must be satisfied on x-axis (dom(F) repre- 
sents the domain (i.e., relation) of data MBRs for variable Y). The same condi- 
tion holds on y-axis. We call the user predicates in the query such as “X intersect 
Y” and “Y intersect Z” the direct predicates, and the derived predicates such as 
X-dist{X, Z) < max{bjx} and X-dist(Y,W) < m.ax{ckx} the indirect predicates. 
In R-trees, the x-length and y-length of MBRs of intermediate nodes may be 
longer than the max x-length and max y-length of the data MBRs in the do- 
main. In Figure 0 if xjiist{A,C) > max{bjx} (or xjiist{B,D) > maxjcfca;}, or 
X-dist{A,D) > ma,yi{bjx} + maxjcfca;}), we do not have to visit the node-tuple 
{A, B,C, D) because the descendent node-tuples will never satisfy the query. 
Therefore, if we take advantage of the indirect predicates in intermediate levels 
of the M-way R-tree join, we can achieve more pruning effects. We call such 
pruning indirect predicate filtering (IPF). The max x-length and y-length can be 
obtained from the statistical information in the database schema. 



2) Indirect Predicate Paths and Lengths In Figure 0 we call the 
paths ABC, BCD and ABCD for indirect predicate pairs AC, BD and AD the 
indirect predicate paths (ipp), and the x-path lengths max{bjx}, maxjcfea;}, and 
max{6ja;} -I- maxjcfea;} the indirect predicate xjpath lengths (x_ippl). The indirect 
predicate y-path lengths (y_ippl) are similarly defined. In Figure |21 since there 
is only one indirect predicate path for each indirect predicate pair, it is easy to 
compute indirect predicate paths and indirect predicate path lengths. However, 
there can be several indirect predicate paths for an indirect predicate pair in a 
general M-way join, and the x_path and y_path for the predicate pair can be 
different. Therefore, we need a systematic method to compute indirect predicate 
paths and their lengths. 

We first draw a query graph whose nodes represent relations and edges rep- 
resent direct predicates. Then, we assign weights to nodes. The weight of a node 
is the maximum x-length (xjmax) and y-length (yjmax) in the relation which 
the node represents. Since there can be multiple paths between a node pair, we 
compute the ipp and ippl by using the shortest path algorithm 0 . In order to get 
the shortest path between a node pair, we need edge weights but we have only 
node weights now. Therefore, we obtain the edge weights from node weights. The 
weight of an edge is obtained by summing weights of nodes on which the edge is 
incident. An example guery graph having both node weights and edge weights 
for a 5-way join is shown in Figure 0a). We call this query graph maximum 
weighted query graph. 

When there is no direct predicate between two nodes S and D in a maximum 
weighted query graph, the ipp and ippl between S and D can be obtained as 
follows: First, we calculate the shortest path and shortest path length per axis. 
Next, we subtract the weights of both S and D from the shortest path length 
and then divide the shortest path length by 2. This is because we want to get the 
sum of the weights of intermediated nodes in the shortest path, but the weights 
of S and D are included in the edge weights of the shortest path length, and 
the weights of the intermediated node are included twice. Therefore, the x_ippl 
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between S and D can be calculated by Expression O- The y_ippl is similarly 
defined. 

xJppl{S, D) = {x shortest jpathdength{S, D) — xjmax(S) — xjmax{D))/2 

( 1 ) 

The ipp’s and ippl’s for all indirect predicate pairs in Figure EJa) are shown in 
Figure E^b). In Figure El the xJpp and y_ipp are different for indirect predicate 
pairs AD and AE. 




(a) 



pairs 


xjpp 


xjppl 


yjpp 


yjppi 


AD 


ABD 


20 


ACD 


50 


BC 


BDC 


10 


BDC 


10 


BE 


BDE 


10 


BDE 


10 


AE 


ABDE 


30 


ACE 


50 



(b) 



Fig. 3. Maximum weighted query graph 



The indirect predicates can be simultaneously checked with the additional 
predicates in SortedIntersectionTestA of Algorithm MwayRtreeJoiri-1. We call 
the algorithm doing the indirect predicate filtering MwayRtree Join-2. 



3) Maximum Tagged R- Trees Until now, we have used only one max x- 

length and y-length per relation. In this case, if there are several extremely large 
objects in a relation although other objects are not so large, the effect of indirect 
predicates can be considerably degenerated. One possible solution for this is to 
have the max x-length and y-length per R-tree node. A leaf node has the max 
x-length and y-length for MBRs of all entries in the node, and an intermediate 
node has the maximum value for the max x-lengths and max y-lengths of its 
child nodes. In the end, the root node has the max x-length and max y-length 
for the relation. The max x-length per R-tree node is recursively defined as in 
Expression ©• The max y-length is similarly defined. 
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xjnax{N) 



( max{A 
max{a; 




We call the max x-length and y-length per relation domain max information 
and those per R-tree node node max information. By using the node max in- 
formation instead of the domain max information, we can have more prunning 
effects in indirect predicate filtering of the M-way R-tree join. Since only two 
max values are attached per R-tree node (one for x-length and the other for 
y-length), we can ignore the storage overhead due to the max lengths. And since 
the max lengths can be dynamically maintained with the R-tree insertion and 
deletion, we can always have exact max lengths per R-tree node. We call this 
R-tree maximum tagged R-tree. 

We get only once the ipp’s for each axis using the max information in root 
nodes of R-trees because calculating the shortest path for every node-tuple needs 
a large CPU time overheacfl. However, we get the ippl’s for every node-tuple 
based on the ipp’s obtained from the root nodes. We call the algorithm using 
maximum tagged R-trees MwayRtreeJoin_3. 

4 Experiments 

To measure the performance of the M-way R-tree joins, we conducted some ex- 
periments using real data sets. The experiments were performed on a Sun Ultra 
II 170 MHz platform on which Solaris 2.5.1 was running with 384 MB of main 
memory. We implemented the three M-way R-tree join algorithms: MwayRtree- 
Join_l (MRJl), MwayRtree Join-2 (MRJ2) and MwayRtreeJoinS (MRJ3). For 
performance comparisons, we also implemented the multi-level forward checking 
(MFC) algorithm with the dynamic variable ordering (DVO) and the join win- 
dow reduction ( J WR) algorithm which were proposed in j 1 511 ti] . Additionally, we 
implemented another MFC algorithm (MFCl) which uses our space restriction 
ordering (SRO) as well as FC-DVO to check the pure effect of SRO. 

The real data in our experiments were extracted from the TIGER/Line data 
of US Bureau of the Census m- We used the road segment data of 10 counties 
of the California State in the TIGER data. The characteristics (statistical infor- 
mation) of the California TIGER data are summarized in Table Dl The original 
TIGER data of all counties were center-matched to join different county regions, 
i.e., the x and y coordinates of the original TIGER data were subtracted from 
those of the center point of each county. The center-matched data were divided 
by 10 for easy handling. 

We implemented the insertion algorithm in Pj to build R*-trees for each 
county data. The node sizes of the R*-trees considered are 512, 2048 and 4096 
bytes. The tree heights for all county data for each node size are 4, 3 and 3, 
respectively. The LRU buffers are 256 pages in every node siz^. 

^ The complexity of computing all pair’s shortest paths is known to be 0{M^) |S]. 

® We assume that an R*-tree node occupies one page. 
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Table 1. Characteristics of the California TIGER data 



county 


# of obj 


domain area 


max length 


avg length 


density 


Alameda 


49070 


86222*44995 


4662*3940 


102*80 


0.23 


Contra Costa 


40363 


88025*33808 


4676*5112 


100*77 


0.21 


Fresno 


58163 


233238*151898 


7190*4633 


210*167 


0.09 


Kern 


113407 


257781*100758 


8204*6497 


212*169 


0.26 


Monterey 


35417 


175744*112068 


9085*6194 


234*192 


0.20 


Orange 


91970 


69999*55588 


3658*6735 


80*66 


0.21 


Riverside 


91751 


323725*65389 


12113*10062 


158*126 


0.21 


Sacramento 


46516 


75771*71218 


6442*4103 


111*86 


0.24 


San Diego 


103420 


151241*96476 


8054*6828 


122*104 


0.22 


Santa Barbara 


64037 


99301*58696 


4541*6460 


100*81 


0.22 



We selected the following 4 query types as input queries: complete, half, ring 
and chain. Example query graphs for each query type in a 5-way join are shown 
in Figure 0 The spatial predicate used for our experiments is intersect. 




Fig. 4. Example query graphs in a 5-way join 



First, we measured the total response time (CPU time -I- I/O time) for various 
data sets and various query types, and a fixed node size of 2048 bytes. The total 
response time was measured by “the elapsed CPU time -I- the number of I/O * 
the unit I/O time.” The unit I/O time was set to 10 ms which is a typical value 
for a random I/O jYll h) . For this experiment, we extracted the following three 
data sets from the TIGER data shown in Table E An M-way join for each data 
set was performed for the first M counties of the data set. 



Data set 1: 


Ora. 


Sac. 


S.B. 


S.D. 


Ala. 


Kern 


Riv. 


Data set 2: 


Ora. 


Ala. 


Sac. 


S.D. 


S.B. 


Kern 


Mon 


Data set 3: 


C.C. 


S.B. 


Mon. 


Ora. 


Sac. 


Ala. 


Ere. 



The total response time is shown in Table 0 The relative rates of the total 
response time compared to Algorithm MwayRtree Join-1 (MRJl) are shown in 
Figure O (only for the algorithms using the synchronous traversal (ST) tech- 
nique). The numbers of solutions for each data set are also shown in Table 0 
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First, we compared the relative performances among the ST algorithms such 
as MFCs and MRJs. In most cases, SRO considerably reduces the query response 
time (Compare MFC and MFCl in TableQand FigureE]). FC-DVO has a better 
performance in complete and half queries while PSO has a better performance 
in the chain query. In the ring query, both have a similar performance or FC- 
DVO has a slightly better performance. (Compare MFCl and MRJl in Tabled 
and Figure 0) The reason FC-DVO has a better performance in the complete 
and half queries is because FC-DVO prunes the entries of the future variables 
faster with many predicates while PSO does not prune the entries of the outer 
variables until they are actually evaluated. Since the chain and ring queries 
are more general in real life and more time consuming than other queries, we 
think that the optimization for these queries is more important. (According to 
Table El the differences of the query response time between MFCl and MRJl in 
the complete and half queries are within 10 seconds, but the differences in the 
chain query reach about 1000 seconds.) 

Sometimes, in data set 3, MFCl does not work as well as MFC. This is due 
to the locality of LRU buffers and the CPU overhead of SRO. We observed that, 
in these cases, while MFCl accessed fewer nodes, MFC performed a smaller or 
similar number of I/O’s. However, in most cases, MFCl performed a smaller 
number of I/O’s. 

Next, we measured the performance of indirect predicate filtering (IPF). In 
this measurement, we excluded the complete query type because no indirect 
predicates are in the complete query. In the half query, there is nearly no effect 
of indirect predicates (Compare MRJl and MRJ2 in Table El and Figure Ej). We 
do not present the effect of the maximum tagged R-tree (MRJS) in the half query 
because it is similar to that of MRJ2 in most cases. IPF has considerable impact 
on ring and chain queries. As the number of direct predicates decreases, the effect 
of indirect predicates increases. In summary, the three optimization techniques 
(SRO, PSO and IPF) improve efficiency. The maximum improvements compared 
to MFC are about 40%, 80%, 140% and 300%, respectively, for the complete, 
half, ring and chain queries. 

A little later than the early version of this paper m, other optimization 
techniques called static variable ordering (SVO) and plane sweep and forward 
checking (PSFC) were developed fS]. SVO orders the variables (or nodes) once 
according to the degrees before the algorithm starts. This static ordering is used 
both for the search space restriction and the forward checking. PSFC works as 
follows: The first variable is instantiated by a plane sweep, and a variant of the 
forward checking, called sorted forward checking, is used for the instantiations of 
remaining variables according to SVO. We believe that SRO is superior to SVO 
because it uses more sophisticated criteria. Actually, the experimental results in 
TableEland Figure Osupport our opinion. In complete and ring query graphs, the 
space restriction using SVO is the same as Algorithm SpaceRestriction-1 used 
in MFC because the degrees of all nodes are the same. Since the experimental 
results show that MFCl outperforms MFC in most cases, SRO will be superior 
to SVO. On the other hand, when there are many direct predicates, PSFC will 
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Table 2. Total response time for various data sets (node size: 2048, unit: sec) 





Data set 1 


Data set 2 


Data set 3 


Complete 


M 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


MFC 


17 


19 


10 


13 


13 


8 


9 


11 


14 


11 


14 


21 


10 


11 


15 


MFC1 


14 


14 


8 


11 


11 


6 


7 


8 


1 1 


9 


13 


22 


9 


10 


14 


MRJ1 


15 


17 


8 


11 


12 


6 


8 


9 


12 


9 


13 


23 


9 


10 


14 


JWR 


36 


57 


24 


25 


24 


16 


22 


24 


25 


17 


41 


72 


39 


41 


43 


Half 


M 


4 


5 


6 


7 


4 


5 


6 


7 


4 


5 


6 


7 


MFC 


22 


36 


22 


28 


11 


21 


24 


21 


34 


14 


26 


41 


MFC1 


19 


20 


18 


21 


10 


14 


19 


16 


36 


13 


18 


30 


MRJ1 


21 


20 


20 


22 


11 


13 


19 


17 


46 


14 


17 


33 


MRJ2 


21 


20 


20 


23 


11 


13 


20 


17 


47 


15 


17 


33 


JWR 


78 


33 


56 


285 


74 


29 


29 


72 


223 


194 


58 


223 


Ring 


M 


4 


5 


6 


7 


4 


5 


6 


7 


4 


5 


6 


7 


MFC 


25 


26 


106 


710 


12 


26 


83 


295 


34 


45 


122 


649 


MFC1 


20 


21 


81 


461 


10 


21 


65 


213 


36 


34 


99 


500 


MRJ1 


22 


21 


77 


470 


11 


22 


68 


196 


37 


36 


119 


623 


MRJ2 


22 


20 


69 


364 


11 


22 


63 


176 


38 


35 


111 


397 


MRJ3 


21 


19 


63 


301 


11 


22 


60 


152 


40 


34 


101 


333 


JWR 


198 


159 


1048 


228 


41 


97 


172 


217 


172 


171 


209 


183 


Chain 


M 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


MFC 


26 


81 


335 


1469 


3805 


11 


32 


166 


939 


2105 


21 


251 


191 


1738 


7851 


MFC1 


24 


69 


244 


969 


2363 


9 


26 


114 


632 


1428 


19 


223 


162 


1256 


5396 


MRJ1 


23 


62 


181 


818 


2026 


8 


22 


92 


538 


1198 


18 


184 


123 


896 


4324 


MRJ2 


23 


57 


147 


580 


1341 


8 


20 


76 


396 


859 


18 


196 


127 


779 


3218 


MRJ3 


23 


55 


131 


454 


954 


8 


20 


67 


342 


719 


18 


201 


126 


655 


2815 


JWR 


40 


212 


455 


1091 


1557 


17 


44 


106 


185 


426 


58 


277 


2681 


841 


1276 



Type 



Data set 1 Data set 2 Data set 3 





140 

120 




140 

120 






▼ 






100 


f. 1 , 1 ♦ 


100 




A jL 4. ^ ^ 








80 




80 





-♦-MFC I 
-■-MFC1 
-♦-MRJ1 I 



Complete 



Half 












Chain 



7 



Fig. 5. Rates of total response time for various data sets (node size: 2048) 
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Table 3. Number of solutions for various data sets 



M 




3 


4 


5 


6 


7 


Data Set 1 


Complete 


16,156 


3,893 


435 


192 


31 


Half 




18,897 


25,578 


881 


128 


Ring 




11 ,590 


7,098 


12,298 


4,220 


Chain 


131 ,759 


329,855 


440,945 


475,497 


81 ,419 


Data Set 2 


Complete 


4,733 


1 ,719 


435 


192 


77 


Half 




15,327 


2,371 


825 


318 


Ring 




5,209 


5,916 


8,724 


19,574 


Chain 


23,155 


61 ,446 


56,295 


254,505 


177,627 


Data Set 3 


Complete 


23,188 


21 ,880 


2,506 


725 


152 


Half 




232,068 


6,152 


4,558 


2,074 


Ring 




161 ,611 


72,346 


51 ,600 


42,238 


Chain 


102,327 


2 , 753,856 


530,673 


1 ,271 ,835 


3 , 441 ,939 



naturally outperform PSO because our experiments show that MFC outperforms 
PSO for numerous direct predicates. 

Next, we compared the query response time between ST algorithms and 
JWR. As the variable instantiation order of JWR, we used the same as in PSO. 
According to the result shown in Table El ST algorithms have better perfor- 
mances in all Ms of complete and half queries and in most Ms of other queries. 
When M is high (6 or 7), JWR has a better performance than MFC for some 
data sets in ring and chain queries, which is similar to the result in |lti| . There 
are some cases that JWR has a better performance than MFC for some data 
sets, but has a worse performance than MRJs. For example, see Table E| for 
M=6,7 and data set 1, M=5 and data set 2, and M=6 and data set 3 in the 
chain query. Therefore, unlike the experimental results in ITM?a , we can use our 
M-way R-tree join algorithms for a higher range of M. 

Sometimes, the costs of JWR are abruptly increased (for example, M=7 in 
the half query of data set 1, M=6 in the ring query of data set 1, and M=5 
in the chain query of data set 3). We think this is due to the evaluation order 
of variables. While real data sets are highly skewed, PSO does not consider the 
data distribution. However, the variable ordering worked properly in most other 
cases. 

Next, we conducted an experiment for various node sizes. Tabled shows the 
total response time of all algorithms for various node sizes and a fixed data set 
2. Figure Elillustrates the performance rates of the total response time compared 
to MRJl. According to Figure El SRO has large effects in most cases. And the 
smaller the node size is, the better the performance of FC-DVO is. In other 
words, the larger the node size, the better the performance of PSO. (See the 
performance rate of MFCl compared to MRJl.) In particular, PSO has a better 
performance than FC-DVO for node size 4096 of the ring query although both 
have a similar performance for node size 2048. As the node size increases, the 
effect of IPF slightly decreases in ring and chain query types. When the node 
size is 4096, there is nearly no difference between the effect of indirect predicates 
using domain max information (MRJ2) and that using node max information 
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Table 4. Total response time for various node sizes (data set 2, unit: sec) 





512 


2048 


4096 


Complete 


M 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


MFC 


29 


32 


34 


47 


31 


8 


9 


11 


14 


11 


5 


6 


7 


10 


9 


MFC1 


20 


23 


24 


36 


26 


6 


7 


8 


1 1 


9 


4 


5 


6 


9 


8 


MRJ1 


20 


24 


28 


40 


28 


6 


8 


9 


12 


9 


4 


5 


6 


9 


8 


JWR 


39 


49 


47 


50 


34 


16 


22 


24 


25 


17 


14 


20 


22 


23 


16 


Half 


M 


4 


5 


6 


7 


4 


5 


6 


7 


4 


5 


6 


7 


MFC 


38 


77 


77 


65 


11 


21 


24 


21 


8 


17 


20 


21 


MFC1 


32 


50 


63 


54 


10 


14 


19 


16 


7 


11 


15 


15 


MRJ1 


34 


54 


65 


56 


11 


13 


19 


17 


8 


11 


14 


14 


MRJ2 


33 


51 


68 


59 


11 


13 


20 


17 


7 


11 


15 


15 


JWR 


148 


57 


60 


121 


74 


29 


29 


72 


69 


26 


29 


71 


Ring 


M 


4 


5 


6 


7 


4 


5 


6 


7 


4 


5 


6 


7 


MFC 


42 


87 


281 


791 


12 


26 


83 


295 


9 


21 


78 


323 


MFC1 


32 


66 


236 


622 


10 


21 


65 


213 


8 


17 


62 


222 


MRJ1 


34 


73 


243 


629 


11 


22 


68 


196 


7 


14 


51 


189 


MRJ2 


34 


70 


222 


537 


11 


22 


63 


176 


7 


14 


47 


155 


MRJ3 


34 


63 


186 


423 


11 


22 


60 


152 


7 


14 


47 


149 


JWR 


81 


172 


273 


371 


41 


97 


172 


217 


43 


108 


200 


246 


Chain 


M 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


MFC 


41 


93 


472 


3521 


6692 


11 


32 


166 


939 


2105 


8 


27 


147 


774 


1816 


MFC1 


30 


66 


322 


2383 


4702 


9 


26 


114 


632 


1428 


7 


23 


107 


528 


1177 


MRJ1 


30 


69 


303 


2309 


4647 


8 


22 


92 


538 


1198 


5 


14 


62 


362 


876 


MRJ2 


29 


65 


252 


1687 


3467 


8 


20 


76 


396 


859 


5 


13 


48 


259 


600 


MRJ3 


28 


59 


180 


1027 


2064 


8 


20 


67 


342 


719 


5 


13 


48 


263 


592 


JWR 


40 


85 


168 


292 


604 


17 


44 


106 


185 


426 


15 


45 


113 


210 


446 



Type 



512 2048 4096 




|-*-MFC I 
-■-MFC1 





Fig. 6. Rates of total response time for various node sizes (data set 2) 
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(MRJ3). This is because the large node size leads to many entries per node and 
increases the node max information. However, still for a large node size (4096), 
the effect of IPF is large for the ring and chain queries. 

The comparison between ST algorithms and JWR shows that ST algorithms 
perform better for large node sizes. This is due to the index probing overhead in 
JWR. Since there is no global ordering in multi-dimensional non-point objects, 
we should check all entries of a node during an R*-tree search. In addition, while 
ST algorithms have the best performance for all query types in node size 4096 
compared to other node sizes, JWR has the best performance for the ring and 
chain queries in node size 2048. 

Finally, we measured the I/O time (see Table|^and Table EJ. MRJs consume 
more I/O time than MFCs and JWR in high Ms. From Table 0and Table 0 
however, we found an important fact: the higher the value of M, the lower the 
rate of I/O time compared to the total response time. For ring and chain queries, 
the rate of I/O time considerably decreases as M increases. Therefore, the I/O 
time becomes less important and the M-way R-tree join becomes CPU-bound. 
The I/O rate also decreases along the node size. (According to Table El when 
the node size is 4096 and M is 7, the I/O rates in the ring and chain query types 
are less than or equal to 5%.) 



Table 5. I/O time for data set 2 (node size: 2048) 
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I/O rate (%) 
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841 
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45 


4S 
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55 


S7 


17 
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10S6 


1 776 


1 S08 


64 


46 


26 
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1714 


1149 
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47 
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48 
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JWR 


652 
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56 


27 
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S4 


15 


7 


4 
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In overall summary, we recommend the following based on the experimental 
results: First, always use SRO. Second, if there are many direct predicates as in 
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Table 6. I/O time for various node sizes (data set 2) 



# of I/O (4096) 


I/O rate (%) 
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# of I/O (512) 


I/O rate (%) 


Complete 


M 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


MFC 


2563 


2679 


2655 


3396 


1796 


87 


82 


78 


72 


59 


MFC1 


1658 


1782 


1733 


2372 


1402 


84 


79 


73 


65 


53 


MRJ1 


1653 


1902 


2091 


2687 


1536 


84 


80 


75 


67 


55 


JWR 


2382 


2591 


2287 


2449 


1629 


61 


53 


48 


49 


48 


X 


M 


4 


5 


6 


7 


4 


5 


6 


7 


MFC 


2931 


3414 


4216 


2661 


76 


45 


55 


41 


MFC1 


2309 


2285 


3429 


2287 


72 


46 


55 


42 


MRJ1 


2474 


2427 


3710 


2564 


73 


45 


57 


46 


MRJ2 


2451 


2427 


3953 


2701 


74 


48 


58 


46 


JWR 


5702 


2342 


2887 


4109 


39 


41 


48 


34 


Ring 


M 


4 


5 


6 


7 


4 


5 


6 


7 


MFC 


3106 


4767 


7918 


8860 


74 


55 


28 


11 


MFC1 


2326 


3528 


7318 


7644 


72 


53 


31 


12 


MRJ1 


2496 


4332 


8256 


8258 


74 


59 


34 


13 


MRJ2 


2498 


4032 


7954 


8039 


74 


57 


36 


15 


MRJ3 


2489 


3773 


7283 


7229 


73 


60 


39 


17 


JWR 


2673 


2943 


3214 


3095 


33 


17 


12 


8 


Chain 


M 


3 


4 


5 


6 


7 


3 


4 


5 


6 


7 


MFC 


3031 


4740 


7858 


18627 


29918 


73 


51 


17 


5 


4 


MFC1 


2239 


3249 


6149 


16064 


23149 


74 


49 


19 


7 


5 


MRJ1 


2269 


3732 


5855 


20176 


34121 


75 


54 


19 


9 


7 


MRJ2 


2255 


3539 


5659 


17228 


28152 


77 


55 


22 


10 


8 


MRJ3 


2250 


3388 


5384 


14865 


21 579 


79 


58 


30 


14 


10 


JWR 


2378 


2690 


2900 


3184 


3334 


59 


32 


17 


1 


6 



the complete and half queries, use FC-DVO and no IPF. Third, if the number 
of direct predicates is small as in the ring and chain queries, use PSO and IPF. 
Fourth, if the node size is small and M is high, use JWR; otherwise, use ST 
algorithms. 

5 Conclusions 

In this paper, we study the generalization of the 2-way R-tree join. We proposed 
the following three optimization techniques: space restriction ordering (SRO), 
plane sweep ordering (PSO) and indirect predicate filtering (IPF). Through ex- 
periments using real data, we showed that our three optimization techniques 
have a great impact on improving the performance of synchronous traversal 
(ST) algorithms. 

After completing the M-way R-tree join, an oid pair may appear several times 
in the resulting oid-tuples. If the oid-tuples are read in the combined refinement 
step without scheduling, it may access the same page several times and perform 
the same refinement operation several times. However, this can be solved by 
extending scheduling methods for oid pairs such as 1231 to oid-tuples. In future 
studies, first, we will develop an efficient combined refinement algorithm for the 
M-way spatial join. Second, although we found that the I/O rate of the total 
response time decreases as M increases, the I/O rate is still high for a small M. 
Therefore, we will develop I/O optimization techniques for the M-way R-tree 
join. Last, we will combine the optimization techniques proposed in this paper 
with our rule-based optimization technique for spatial and non-spatial mixed 
queries called ESFAR (Early Separated Filter And Refinement) |lYll8j . 
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Abstract. The family of R-trees is suitable for storing various kinds 
of multidimensional objects and is considered an excellent choice for 
indexing a spatial database. Region Quadtrees are suitable for storing 
2-dimensional regional data and their linear variant is used in many 
Geographical Information Systems for this purpose. In this report, we 
present five algorithms suitable for processing join queries between these 
two successful, although very different, access methods. Two of the al- 
gorithms are based on heuristics that aim at minimizing I/O cost with 
a limited amount of main memory. We also present the results of exper- 
iments performed with real data that compare the I/O performance of 
these algorithms. 

Index terms: Spatial databases, access methods, R-trees, linear quad- 
trees, query processing, joins. 



1 Introduction 

Several spatial access methods have been proposed in the literature for storing 
multi-dimensional objects (e.g. points, line segments, areas, volumes, and hyper- 
volumes). These methods are classified in one of the following two categories 
according to the principle guiding the hierarchical decomposition of data regions 
in each method. 

— Data space hierarchy: a region containing data is split (when, for example, a 
maximum capacity is exceeded) to sub-regions which depend on these data 
only (for example, each of two sub-regions contains half of the data) 

— Embedding space hierarchy: a region containing data is split (when a certain 
criterion holds) to sub-regions in a regular fashion (for example, a square 
region is always split in four quadrant sub-regions) 
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The book by Samet m and the recent survey by Gaede and Guenther |2| provide 
excellent information sources for the interested reader. 

A representative of the first principle that has gained significant appreciation 
in the scientific and industrial community is the R-tree. There are a number of 
variations of this structure all of which organize multidimensional data objects 
by making use of the Minimum Bounding Rectangles (MBRs) of the objects. 
This is an expression of the “conservative approximation principle” . This family 
of structures is considered an excellent choice for indexing various kinds of data 
(like points, polygons, 2-d objects, etc) in spatial databases and Geographical 
Information Systems. 

A famous representative of the second principle is the Region Quadtree. This 
structure is suitable of storing and manipulating 2-dimensional regional data 
(or binary images). Moreover, many algorithms have been developed based on 
Quadtrees m- The most widely known secondary memory alternative of this 
structure is the Linear Region Quadtree m- Linear Quadtrees have been used 
for organizing regional data in Geographical Information Systems m- 

These totally different families of popular data structures can co-exist in a 
Spatial Information System. For example, in his tutorial in the same conference, 
Sharma m refers to spatial and multimedia extensions to the Oracle 8i server 
that are based on the implementation of a linear quadtree and a modified R*- 
tree. Each of these structures can be used for answering a number of very useful 
queries. However, the processing of queries that are based on both structures 
has not been studied in the literature. In this report, we present a number of 
algorithms that can be used for processing joins between the two structures. 

For example, the R-tree data might be polygonal objects that represent swim- 
ming and sun-bathing sites and the Quadtree data a map, where black color 
represents a decrease of ozon and white color represents ozon safe areas. A user 
may ask which sites suffer from the ozon problem. The major problem for an- 
swering such a query is to make use of the space hierarchy properties of each of 
the structures, so that not to transfer in main memory irrelevant data, or not to 
transfer the same data many times. Three of the proposed algorithms are simple 
and suffer from such unnecessary transfers, when the buffering space provided 
is limited. We also propose another two more sophisticated algorithms that deal 
with this problem by making use of heuristics and achieve good performance 
with a limited amount of main memory. 

The organization of the paper is as follows. In Section 2, we present in brief 
the families of R-trees and Linear Region Quadtrees. In Section 3, we review 
join processing in spatial databases. In Section 4, we present the algorithms that 
process R-Quad Joins. More specifically, in Subsections 4.1 to 4.3 we present 
the three simple algorithms, in Subsection 4.4 our heuristics and the buffering 
scheme used and in Subsections 4.5 and 4.6 the two sophisticated algorithms. In 
Section 5, we present our experimental setting and some results of experiments 
we performed with real data. These experiments compare the I/O performance 
of the different algorithms. In Section 6, we summarize the contribution of this 
work and discuss issues that require further research in the future. 
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2 The Two Structures 

2.1 R- Trees 

R-trees are hierarchical data structures based on B+-trees. They are used for the 
dynamic organization of a set of k-dimensional geometric objects representing 
them by the minimum bounding k-dimensional rectangles (in this paper we focus 
on 2 dimensions) . Each node of the R-tree corresponds to the minimum rectangle 
that bounds its children. The leaves of the tree contain pointers to the objects of 
the database, instead of pointers to children nodes. The nodes are implemented 
as disk pages. 

It must be noted that the rectangles that surround different nodes may be 
overlapping. Besides, a rectangle can be included (in the geometrical sense) in 
many nodes, but can be associated to only one of them. This means that a spatial 
search may demand visiting of many nodes, before confirming the existence or 
not of a given rectangle. 

The rules obeyed by the R-tree are as follows. Leaves reside on the same 
level. Each leaf contains pairs of the form {R, O), such that R is the minimum 
rectangle that contains spatially object O. Every other node contains pairs of the 
form (i?, P), where P is a pointer to a child of the node and R is the minimum 
rectangle that contains spatially the rectangles contained in this child. An R-tree 
of class (m, M) has the characteristic that every node, except possibly for the 
root, contains between m and M pairs, where m < \M/2 \ . The root contains at 
least two pairs, if it is not a leaf. Figure [D depicts some rectangles on the right 
and the corresponding R-tree on the left. Dotted lines denote the bounding 
rectangles of the subtrees that are rooted in inner nodes. 
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Fig. 1. An example of an R-tree 



Many variations of R-trees have appeared. The most important of theses are 
packed R-trees R+ trees and R*-trees 0. The R*-tree does not have the 
limitation for the number of pairs of each node and follows a node split technique 
that is more sophisticated than that of the simple R-tree. It is considered the 
most efficient variant of the R-tree family and, as far as searches are concerned. 
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it can be used in exactly the same way as simple R-trees. This paper refers to 
simple R-trees or to R*-trees. 



2.2 Region Quadtrees 

The Region Quadtree is the most popular member in the family of quadtree- 
based access methods. It is used for the representation of binary images, that 
is 2” X 2” binary arrays (for a positive integer n), where a 1 (0) entry stands 
for a black (white) picture element. More precisely, it is a degree four tree with 
height n, at most. Each node corresponds to a square array of pixels (the root 
corresponds to the whole image). If all of them have the same color (black or 
white) the node is a leaf of that color. Otherwise, the node is colored gray and 
has four children. Each of these children corresponds to one of the four square 
sub-arrays to which the array of that node is partitioned. We assume here, that 
the first (leftmost) child corresponds to the NW sub-array, the second to the 
NE sub-array, the third to the SW sub-array and the fourth (rightmost) child to 
the SE sub-array. For more details regarding Quadtrees see |^. Figure |2| shows 
an 8 X 8 pixel array and the corresponding Quadtree. Note that black (white) 
squares represent black (white) leaves, while circles represent gray nodes. 




Fig. 2. An image and the corresponding Region Quadtree 



Region Quadtrees, as presented above, can be implemented as main mem- 
ory tree structures (each node being represented as a record that points to its 
children). Variations of Region Quadtrees have been developed for secondary 
memory. Linear Region Quadtrees are the ones used most extensively. A Linear 
Quadtree representation consists of a list of values, where there is one value for 
each black node of the pointer-based Quadtree. The value of a node is an address 
describing the position and size of the corresponding block in the image. These 
addresses can be stored in an efficient structure for secondary memory (such as 
a B-tree or one of its variations) . There are also variations of this representation 
where white nodes are stored too, or variations which are suitable for multicolor 
images. Evidently, this representation is very space efficient, although it is not 
suited to many useful algorithms that are designed for pointer-based Quadtrees. 
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The most popular linear implementations are the FL (Fixed Length), the FD 
(Fixed length - Depth) and the VL (Variable length) linear implementations m 

In the FL implementation, the address of a black Quadtree node is a code- 
word that consists of n base-5 digits. Codes 0, 1, 2 and 3 denote directions NW, 
NE, SW and SE, respectively, while code 4 denotes a do-not-care direction. If 
the black node resides on level i, where n > i > 0, then the first n — i digits 
express the directions that constitute the path from the root to this node and 
the last i digits are all equal to 4. In the FD implementation, the address of a 
black Quadtree node has two parts: the first part is a code-word that consists 
of n base-4 digits. Codes 0, 1, 2 and 3 denote directions NW, NE, SW and SE, 
respectively. This code-word is formed in a similar way to the code-word of the 
FL-linear implementation with the difference that the last i digits are all equal 
to 0. The second part of the address has \log 2 {n+ 1)] bits and denotes the depth 
of the black node, or in other words, the number of digits of the first part that 
express the path to this node. In the VL implementation the address of a black 
Quadtree node is a code- word that consists of at most n base-5 digits. Code 0 is 
not used in addresses, while codes 1, 2, 3 and 4 denote one of the four directions 
each. If the black node resides on level i, where n > i > 0, then its address 
consists of n — i digits expressing the directions that constitute the path from 
the root to this node. The depth of a node can be calculated by finding the 
smallest value equal to a power of 5 that gives 0 quotient when the address of 
this node is divided (using integer division) with this value. 

In the rest of this paper we assume that Linear Quadtrees are represented 
with FD-codes stored in a B+-tree (this choice is popular in many applications). 
The choice of FD linear representation, instead of the other two linear repre- 
sentations, is not accidental. The FD linear representation is made of base-4 
digits and is thus easily handled using two bits for each digit. Besides, the sorted 
sequence of FD linear codes is a depth-first traversal of the tree. Since internal 
and white nodes are omitted, sibling black nodes are stored consecutively in the 
B+-tree or, in general, nodes that are close in space are likely to be stored in the 
same or consecutive B+-tree leaves. This property helps at reducing the I/O cost 
of join processing. Since in the same quadtree two black nodes that are ancestor 
and descendant cannot co-exist, two FD linear codes that coincide at all the 
directional digits cannot exist neither. This means that the directional part of 
the FD-codes is sufficient for building B+-trees at all the levels. At the leaf- level, 
the depth of each black node should also be stored so that images are accurately 
represented. In Figure El you can see the directional code of each black node of 
the depicted tree. 

3 Spatial Join Processing 

In Spatial Databases and Geographical Information Systems there exists the 
need for processing a significant number of different spatial queries. For example, 
such queries are: nearest neighbor finding, similarity queries window queries, 
content based queries m, or spatial joins of various kinds. A spatial join consists 
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in testing every possible pair of data elements belonging to two spatial data sets 
against a spatial predicate. This predicate might be overlap, distance within, 
contain, intersect, etc. In this paper we mainly focus on the intersection spatial 
join (the most widely used join type), or on spatial joins which are processed in 
the same way as the intersection join. 

There have been developed various methods for processing spatial joins for 

tm-A 



spatial data using approximate geometry Ildll4l . two R-trees |^, PMR quad- 
trees 13, seeded trees when one |3 or none uni of the data sets does not have a 
spatial index, spatial hashing 1111 11151 . or sort merge join j^. 

In this paper, we make the assumption that our spatial information system 
keeps non-regional data in R-trees or R*-trees and regional data in Linear Region 
Quadtrees, while users pose queries that involve both these two kinds of data. 
For example, the non-regional data might be cities and the regional data a map 
where black represents heavy clouds and white rather sunny areas. The user is 
very likely to ask which cities are covered with clouds. 

Most spatial join processing methods are performed in two steps. The first 
step, which is called filtering, chooses pairs of data that are likely to satisfy 
the join predicate. The second step, which is called refinement, examines the 
predicate satisfaction for all these pairs of data. The algorithms presented in 
this paper, focus on the function of the filtering step and show how a number 
of pairs of the form (Quadtree block, MBR of object) can be produced (the two 
members of each pair produced intersect). 



4 Join Algorithms 

Before join processing, the correspondence of the spaces covered by the two 
structures must be established. A level-n Quadtree covers a quadrangle with 
2" X 2" pixels, while an R-tree covers a rectangle that equals the MBR of its root. 
Either by asking the user for input, or by normalizing the larger side of the R- 
tree rectangle in respect to 2”, the correspondence of spaces may be determined. 
After this action, the coordinates used in the R-tree are always transformed to 
Quadtree pixel locations. 

Joining of the two structures can be done with very simple ways, if it is 
ignored that both structures are kept in disk pages as multiway trees. These 
ways fall in two categories: either we scan the entries of the B+-tree and perform 
window queries in the R-tree, or we scan the entries of the R-tree and perform 
window queries in the B~''-tree. More specifically, we designed and implemented 
the following three simple algorithms. 



4.1 B+ to R Join 

— Descend the B+-tree from the root to its leftmost leaf. 

— Access sequentially (in increasing order) the FDs present in this leaf and for 
each FD perform a range search in the R-tree (reporting intersections of this 
FD and MBRs of leaves) . 
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— By making use of the horizontal inter-leaf pointer, access the next B”''-tree 
leaf and repeat the previous step. 

This algorithm may access a number of FDs (and the leaves in which they reside) 
that do not intersect with any data elements stored in the R-tree. Moreover, this 
algorithm is very probable to access a number of R-tree nodes several times. 



4.2 R to B+ Join with Sequential FD Access 

— Traverse recursively the R-tree, accessing the MBRs in each node in order 
of appearence within the node. 

— For the MBR of each leaf accessed, search in the B+-tree for the FD of the 
NW corner of this MBR, or one of its ancestors. 

— Access sequentially (in increasing order) the FDs of the B+-tree until the FD 
of the SE corner of this MBR, or one of its ancestors is reached (reporting 
intersections of FDs and this MBR). 

This algorithm may perform unnecessary accesses in both trees, while multiple 
accesses in B“'"-tree leaves are very probable. The unnecessary accesses in the 
B+-tree result from the sequential access of FDs. The following algorithm is a 
variation that deals with B+-tree accessing differently. 



4.3 R to B+ Join with Maximal Block Decomposition 

— Traverse recursively the R-tree, accessing the MBRs in each node in order 
of appearence within the node. 

— For the MBR of each leaf accessed, decompose this MBR in maximal quad- 
tree blocks. 

— For each quadblock, search in the B+-tree for the FD of the NW corner of 
this quadblock, or one of its ancestors. 

— Access sequentially (in increasing order) the FDs of the B+-tree until the 
FD of the SE corner of this quadblock, or one of its ancestors is reached 
(reporting intersections of FDs and the respective MBR). 

Although this algorithm saves many unnecessary FDs accessed, each search for a 
quadblock descends the tree. Nevertheless, the same intersection may be reported 
more than once. To eliminate duplicate results, a temporary list of intersections 
for the current leaf is maintained. 



4.4 Heuristics and Buffering Scheme 

In order to overcome the unnecessary and/or duplicate accesses of the previ- 
ous algorithms, we propose a number of heuristics/rationales that focus on the 
opposite direction, that of increasing I/O performance. 
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— heuristic 1: Process small enough parts of the R-tree space so that the join 
processing of each part can be completed (in most cases) with the limited 
number of FD-codes that can be kept in main memory. At the presented 
form of the algorithm, each of these parts is a child of the root. 

— heuristic 2: Process the children of an R-tree node in order that is close to 
the order in which the FD-codes (quadtree sub-blocks) are transferred in 
main memory. This order, called FD-order, is formed by sorting MBRs by 
the FD-code that corresponds to their NW corner. 

— heuristic 3: While processing a part of the R-tree space, keep in memory only 
the FD-codes that may be needed at a later stage, drop all other FD-codes 
and fill up buffer with FD-codes that are needed but were not transferred in 
memory due to the buffer limit. 

— heuristic 4: Use a buffer scheme for both trees that reduces the need to 
transfer in memory multiple times the same disk pages (explained in detail 
below) . 

A buffering scheme that obeys Heuristic 4 is presented graphically in Figure El 
In detail, this scheme is as follows. 




Fig. 3. The buffering scheme 

— There is a path-buffer for R-tree node-pages (with number of pages equal 
to the height of the R-tree). However, the buffer pages of the R-tree buffer 
are larger than the actual R-tree disk pages, because for each entry (each 
MBR) an extra point is kept. This point is called START and expresses the 
pixel where processing of the relevant MBR has stopped (a special value, 
named MAX, specifies that processing of that MBR has been completed). 
This means that during transfers from disk to memory and the opposite an 
appropriate transformation of the page contents needs to be made. 
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— There is a path-buffer for B+-tree node-pages (with number of pages equal 
to the height of the B+-tree). 

— It is assumed that the operating system keeps a large enough LRU-buffer 
for disk reads. The same assumption was made in |3|. This buffer is used for 
pages belonging in paths related to the current path that are likely to be 
accessed again in subsequent steps of the algorithm. 

— The last buffer, called FD-buffer, is not a page buffer but one that holds the 
FDs needed for the processing of the current R-tree part. Each entry in this 
buffer contains also a level mark (LM), that is a number that expresses the 
level of the current R-tree path at which and below which the related FD 
might be needed for join processing. The size of this buffer is important for 
the I/O efficiency of the sophisticated algorithms. 

Note that, the three simple algorithms described above can be easily made 
more effective by using a path-buffer and an LRU-buffer for each tree. As will be 
demonstrated in the experimentation section, by using adequately large LRU- 
buffers, the performance of the simple algorithms is comparable to that of the 
sophisticated ones. 

The searching method used in the algorithms is as follows. When, for a point 
belonging in an R-tree MBR, the existence of a Linear Quadtree code that covers 
this point (its block contains this point) needs to be determined, we search the 
B+-tree for the maximum FD-code M that is smaller than or equal to the FD- 
code P of the pixel related to this point. If M = P and depth(M) = depth(P), 
then this specific black pixel exists in the Quadtree. If M < P, depth(M) < 
depth(P) and the directional codes of M and P coincide in the first depth(M) 
bits, then M is a parent of P (it represents a block that contains P). This 
searching method is used in lines 32 and 47 of the “One level FD-buffer join” and 
“Many levels FD-buffer join” algorithms, respectively. In the following, these two 
algorithms, which are designed according to the above heuristics, are presented. 

4.5 One Level FD-Buffer Join 

In very abstract terms, this algorithm works as follows: 

— Process the children of the R-tree root in FD-order. 

— Read as many FDs as possible for the current child and store them in FD- 
buffer. 

— Call recursively the Join routine for this child. 

— When the Join routine returns, empty the FD-buffer and repeat the previous 
two steps until the current child has been completely checked. 

— Repeat for the next child of the root. 

The Join routine for a node works as follows: 

— If the node is a leaf, check intersections and return. 

— If not (this is a non- leaf node), for each child of the node that has not 
been examined in relation to the FDs in FD-buffer, call the Join routine 
recursively. 
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In pseudo-code form this algorithm is as follows: 



01 

02 

03 

04 

05 

06 

07 

08 

09 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 



insert R-tree root in path-buffer; 
for every MBR x in R-tree root 
START(x) := NW-corner-of(x); 
order MBRs in R-tree root according to FD of their START; 
for every MBR x in R-tree root, in FD-order 
while START(x) < MAX begin 
I read-FDs-in-buffer(x); 

I R-Quad-Join(node of x); 

I remove every FD from FD-buffer; 
end; 

Procedure R-Quad-Join(Z: R-tree node); 

begin 

if Z is not in path-buffer 
insert Z in path-buffer; 
if Z is internal then begin 
I for every MBR x in Z 
I START(x) := NW-corner-of(x); 

I order MBRs in Z according to FD of their START; 

I for every MBR x in Z, in FD-order 

I if START(x) < START(MBR of Z) begin 

I I START(x) := hrst pixel of x after the last FD accessed, 

I I or MAX (if no such pixel exists); 

I j if START(x) yf MAX or at least one FD in FD-buffer intersects x 

I j R-Quad-Join(node of x); 

I end; 

end 

else 

check and report possible intersection of MBR of Z and 
every FD in FD-buffer; 

end; 

Procedure read-FDs-in-buffer(Z: MBR); 

begin 

while START(Z) < MAX and FD-buffer not full begin 
I search in QuadTree for FD f covering START(Z) or 
I for the next FD (in FD-order); 

I if no FD was accessed 
I f := MAX; 

I if f intersects Z 
I store f in FD-buffer; 

I START(Z) := first pixel of Z after f, or MAX (if no such pixel exists); 

end; 

end; 



In Figure^ a simple example demonstrating the simultaneous subdivision of 
space by a Quadtree for an 8x8 image (thin lines) and part of an R-tree (thick 
dashed lines) is depicted. The Quadtree contains two black quadrangles (the 
two dotted areas), Q1 and Q2. The MBR of the root of the R-tree is rectangle 
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NW-corner-of(B) = 003, level(B) = 1 
SE-corner-of(B) = 301 
NW-corner-of(C) = 003, level(C) = 0 
SE-corner-of(C) = 031 



R-tree 

data 



Q1 = 000, level = 1 
Q2 = 031, level = 0 



Quadtree 

data 



Fig. 4. An example showing Quadtree (thin lines) and R-tree (thick dashed 
lines) subdivision of space. 



A. However, in Figure 0| only one of the children of this root is depicted, the 
one with MBR B. Moreover, a child of B is depicted, the one with MBR C. 
We suppose that C is a leaf of the R-tree and it has no intersection with the 
other children of B (not shown in the figure) . Consider a trivial situation, where 
the FD-buffer of the algorithm presented above can host one FD-code and the 
corresponding level mark (LM), only. Let’s trace the most important steps of a 
partial run of this algorithm for this specific case. 



1 


START(B) = 003; 


(1.3) 


2 


read-FDs-in-buffer (B) ; 


(1.7) 


3 


search reaches Q1 


(1.32) 


4 


FD-buffer: f = 000 


(1.36) 


5 


START(B) = 012; 


(1.37) 


6 


R-Quad-Join (B); 


(1.8) 


7 


Since B is internal... 


(1.15) 


8 


START(C) = 003; 


(1.17) 


9 


C is the first MBR in B 


(1.18) 


10 


Since START(C) < START(B)... 


(1.20) 


11 


START(C) = 012; 


(1.21) 


12 


R-Quad-Join(C); 


(1.23) 


13 


Since C is a leaf... 


(1.15) 


14 


report intersection of C and Ql; 


(1.27) 


15 


R-Quad-Join(C) returns; 


(1.28) 


16 


Other children of B are processed 


(1.19) 


17 


R-Quad-Join(B) returns; 


(1.28) 


18 


remove 000 from FD-buffer; 


(1.9) 


19 


read-FDs-in-buffer (B); 


(1.7) 


20 


search reaches Q2 


(1.32) 


21 


FD-buffer: f = 031 


(1.36) 
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22 START(B) := 032; (1.37) 

23 R-Quad-Join (B); (1.8) 

24 Since B is internal... (1-15) 

25 START(C) = 003; (1.17) 

26 C is the first MBR in B (1-18) 

27 Since START(C) < START(B)... (1.20) 

28 START(C) = MAX; (1.21) 

29 R-Quad-Join(C); (1.23) 

30 Since C is a leaf... (1-15) 

31 report intersection of C and Q2; (1.27) 

32 R-Quad-Join(C) returns; (1-28) 

33 Other children of B are processed (1.19) 

34 R-Quad-Join(B) returns; (1-28) 

35 remove 031 from FD-buffer; (1.9) 



The run of the algorithm continues with the next loop for B. The interested 
reader can trace the same example with FD-buffer size equal to 2 and note 
the differences. This algorithm only fetches and releases FDs at the level of the 
children of the root. For such a case, the LM for each FD is not necessary to be 
kept in the FD-buffer. 

4.6 Many Levels FD-Buffer Join 

This algorithm, follows the same basic steps, however, it releases from the FD- 
buffer the FDs that will no longer be needed in the current phase of the algorithm 
as soon as possible and fills it up. Again, in very abstract terms, this algorithm 
works as follows: 

— Process the children of the R-tree root in FD-order. 

— Read as many FDs as possible for the current child and store them in FD- 
buffer. 

— Call recursively the Join routine for this child. 

— When the Join routine returns, repeat the previous two steps until the cur- 
rent child has been completely checked. 

— Repeat for the next child of the root. 

The Join routine for a node works as follows: 

— If the node is a leaf, remove from FD-buffer all FDs that will not be needed 
in the current phase of the algorithm, check intersections and return. 

— If not (this is a non- leaf node), for each child of the node that has not been 
examined in relation to the FDs in FD-buffer, mark the FDs that only affect 
the results for this child and call the Join routine recursively. 

— When all the children of the node have been examined, reorder them and 
repeat the previous step until all the children have been examined with the 
current state of the FD-buffer. 

— Remove from FD-buffer all FDs that will not be needed in the current phase 
of the algorithm and return. 
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Due to the possibility to remove FDs from FD-buffer at any level, an extra 
variable, called NEXT, that keeps track of the pixel where fetching of FDs has 
stopped is needed in this algorithm. In pseudo-code form the algorithm is as 
follows: 



01 

02 

03 

04 

05 

06 

07 

08 

09 

10 
11 



insert R-tree root in path-buffer; 
for every MBR x in R-tree root 
START(x) := NW-corner-of(x); 
order MBRs in R-tree root according to FD of their START; 
for every MBR x in R-tree root, in FD-order begin 
I NEXT NW-corner-of(x); 

I while START(x) < MAX begin 

I I read-FDs-in-buffer(node of x); 

I I R-Quad-Join(node of x); 

I end; 

end; 



12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 
27 



28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 



Procedure R-Quad-Join(Z: R-tree node); 

begin 

if Z is not in path-buffer 
insert Z in path-buffer; 
if Z is internal node then begin 
I for every MBR x in Z 

I START(x) := first pixel of x > START(MBR of Z), 

I or MAX (if no such pixel exists); 

I order MBRs in Z according to FD of their START; 

I repeat-flag := True; 

I while repeat-flag begin; 

I I repeat-flag False; 

I I for every MBR x in Z, in FD-order 

I I if START(x) < SE-corner-of(last FD accessed) and 

I I START(x) / MAX begin 

I I I repeat-flag := True; 

II I if FD-buffer not empty 

I I I for every f in FD-buffer with LM(f) = level-of(Z) 

I I I such that x is intersected by f and 

SE-corner-of(f) < START(y), V y 7! x intersected by f 

I I I LM(f) := level-of(node of x); 

II I if at least one FD in FD-buffer intersects x begin 

II II R-Quad-Join(node of x); 

II II read-FDs-in-buffer(Z); 

I I I 

I I I else 

I I I START(x) first pixel of x after the last FD accessed, or 

I I I MAX (if no such pixel exists); 

I I end; 

I I order MBRs in Z according to FD of their START; 

I end; 

end 
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39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 



else 

check and report possible intersection of MBR in Z and every 
FD f in FD-buffer, such that SE-corner-of(f) > START(MBR of Z); 
START(MBR of Z) := first pixel of MBR of Z after the last FD accessed, or 
MAX (if no such pixel exists); 

remove from FD-buffer every FD f with LM(f) = level-of(Z); 

end; 

Procedure read-FDs-in-buffer(Z: R-tree node); 

begin 

while NEXT < MAX and FD-buffer not full begin 
I search in QuadTree for FD f covering NEXT 
I or for the next FD (in FD-order); 

I if no FD was accessed 
I f := MAX; 

I if f intersects the active MBR in the top path-buffer node begin 
I I LM(f) := level-of(top path-buffer node) - 1; 

I I while LM(f) > level-of(Z) and f intersects only the active MBR 

I I in the next lower path-buffer node 
I I LM(f) := LM(f) - 1; 

I I store f in FD-buffer 
I end; 

I NEXT := first pixel of the active MBR in the top path-buffer node after f, 
I or MAX (if no such pixel exists); 

end; 

end; 



5 Experimentation 

We performed experiments with Sequoia data from the area of California. The 
point data correspond to specific country sites, while regional data correspond 
to three different categories: visible, emitted infrared and reflected infrared spec- 
trums. We performed experiments for all the combinations of point and regional 
data. The query used for all experiments is the intersection join query between 
these two kinds of data: “which country sites are covered with colored regions?” . 

There were many parameters that varied in these experiments. Each pixel 
of the regional data had a range of 256 values. Each image was converted to 
black and white by choosing a threshold accordingly so as to achieve a requested 
black-white analogy. This analogy ranged between 20% and 80%. The images 
in our experiments were 1024x1024 and 2048x2048 pixel large. The cardinality 
of the point set was 68764. The page size for both trees was 1024 bytes. Under 
these conditions, both R and B+ trees had 4 levels (including the leaf level). The 
FD-buffer size for the two sophisticated algorithms ranged between 150 and 2500 
FDs. The LRU-buffer size for each tree ranged between 0 and 40 pages. 

In the following, some characteristic results of the large number of experi- 
ments performed for images of 2048 x 2048 pixels from the visible spectrum are 
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Fig. 5. The performance of the five algorithms as a function of LRU-buffer size 
for 50% (upper diagram) and 80% (lower diagram) black images. 



depicted. Note that, since n = 11, an FD for such an image requires 2x11 + 
[/og 2 (ll + 1)1 = 26 bits. In the upper (lower) part of Figure |5| the number of 
disk accesses during join for each of the five algorithms, when images are 50% 
(80%) black, as a function of LRU-buffer size is shown. More specifically, “one- 
500 FDs” stands for “One level FD-buffer join” with FD-buffer holding up to 500 
FDs, “many-500 FDs” for “Many levels FD-buffer join” with FD-buffer holding 
up to 500 FDs, “R-B seq” for “R to B+ Join with sequential FD access”, “R-B 
max” for “R to B'*' Join with maximal block decomposition” and “B-R” for 
“B+ to R Join”. It is evident that R to B+ Join with sequential FD access has 
the worst performance which is not improved as the LRU-buffers size increases. 
Nevertheless, R to B+ Join with maximal block decomposition and B+ to R 
Join gets improved and achieves comparable performance to the sophisticated 
algorithms as the LRU-buffers size increases. The two sophisticated algorithms 
perform well, even for small LRU-buffers. 
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To study the situation more closely, in Figure El we present performance 
results for the two simple algorithms that have better performance and two ver- 
sions (for FD-buffer size equal to 500 and to 1500 FDs) of the sophisticated 
algorithms. The diagram on the upper (lower) part corresponds to 50% (80%) 
black images. We can easily see that the sophisticated algorithms perform very 
well for a wide range of LRU- and FD-buffers sizes, while the two simple algo- 
rithms achieve comparable, or even better, performance than the sophisticated 
ones, when the LRU-buffers size increases. 





Fig. 6. The performance of two of the simple and two versions of the sophisti- 
cated algorithms as a function of LRU-buffer size for 50% (upper diagram) and 
80% (lower diagram) black images. 



Finally, in Figure 0 the performance of sophisticated algorithms for various 
kinds of images as a function of a combination of LRU-buffers size and FD- 
buffers size is depicted. Many levels FD-buffer join performs slightly better than 



Algorithms for Joining R- Trees and Linear Region Qnadtrees 



267 



One level FD-buffer join for all cases. Note that the difference gets smaller when 
FD-buffer size increases. 





2-1500 2-2500 3-150 3-500 3-1500 3-2500 4-150 4-500 4-1500 4-2500 

LRU size - FD buffer size 



Fig. 7. The performance of the sophisticated algorithms as a function of LRU- 
and FD-buffers sizes for 20%, 50% and 80% black images. 



6 Conclusions 

In this report, we presented five algorithms for processing of joins between two 
popular but different structures used in spatial databases and Geographic In- 
formation Systems, R-trees and Linear Region Quadtrees. These are the first 
algorithms in the literature that process joins between these two structures. 
Three of the algorithms are simple, but suffer from unnecessary and/or repeated 
disk accesses, when the amount of memory supplied for buffering is limited. The 
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other two are more sophisticated and are based on heuristics that aim at mini- 
mizing the I/O cost of join processing. That is, they try to minimize the transfer 
to main memory of irrelevant data or the multiple transfer of the same data. 
Moreover, we presented results of experiments performed with real data. These 
experiments investigate the I/O performance of the different join algorithms. 
The presented results show that 

— better performance is achieved when using the sophisticated algorithms in a 
system with small LRU-buffer. For example, in a system with many users, 
where buffer space is used by many processes, the sophisticated algorithms 
are the best choice. Besides, the sophisticated algorithms are quite stable. 
That is, their performance is not heavily dependent on the increase of avail- 
able main memory. 

— When there is enough main memory (when the LRU-buffer is big enough), 
one of the simple algorithms, the B'*' to R Join algorithm, performs very 
well. 

Intuition leads us to believe that the sophisticated algorithms are expected to 
perform better for data that obey unusual distributions, since they are designed 
to partially adapt to the data distribution. 

The presented algorithms perform only the filtering step of the join. Pro- 
cessing of the refinement step requires the choice and use of Computational 
Geometry algorithms pam, which is among our future plans in the area of this 
topic. Each choice should take into account not only worst-case complexity, but 
expected sizes of data sets, average case complexity and multiplicative constant 
of complexity as well, for each alternative. In addition, these algorithms could 
be tested on other kinds of data, e.g. making use of R-trees as a storage medium 
for region data as well. Moreover, we plan to elaborate the presented heuristics 
even further and/or examine different policies of page replacement (other than 
LRU) so as to improve performance, or master the worst case behavior of the 
join algorithms (when they deal with “pathological” data, deviating from the 
usual situations in practice) . Another route of research would be to examine the 
parallel processing of this kind of joins, based on ideas presented in 
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Abstract. We consider the problem of performing polygonal map over- 
lay and the refinement step of spatial overlay joins. We show how to 
adapt algorithms from computational geometry to solve these problems 
for massive data sets. A performance study with artificial and real-world 
data sets helps to identify the algorithm that should be used for given 
input data. 



1 Introduction 

During the last couple of years Spatial- and Geo-Information Systems (GIS) 
have been used in various application areas, like environmental monitoring and 
planning, rural and urban planning, and ecological research. Users of such sys- 
tems frequently need to combine two sets of spatial objects based on a spatial 
relationship. 

In two-dimensional vector based systems combining two maps mi and m2 
consisting of polygonal objects by map overlay or spatial overlay join is an 
important operation. The spatial overlay join of m\ and m2 produces a set of 
pairs of objects (01,02) where oi G mi,02 S m2, and oi and 02 intersect. In 
contrast, the map overlay produces a set of polygonal objects consisting of the 
following objects: 

— All objects of mi intersecting no object of m2 

— All objects of m2 intersecting no object of mi 

— All polygonal objects produced by two intersecting objects of m\ and m2 

The map overlay operation can be considered a special outer join. For simplicity 
of presentation, we assume that all objects of map mi are marked red and 
that all objects of map m2 are marked blue. Usually spatial join operations are 
performed in two steps EH]: 

— In the filter step an — usually conservative — approximation of each spatial 
object, e.g., the minimum bounding rectangle, is used to eliminate objects 
that cannot be part of the result. 

— In the refinement step each pair of objects passing the filter step is examined 
according to the spatial join condition. 



R.H. Giiting, D. Papadias, F. Lochovsky (Eds.): SSD’99, LNCS 1651, pp. 270-|2SSl 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Although there exists a variety of algorithms for realizing the filter step of 
the join operator for massive data sets jSITIl 9W2‘2t2f^ , not much research has been 
done in realizing the refinement step of the map overlay operation for large sets of 
polygonal objects, with the exception of Kriegel et al. . Brinkhoff et al. @ and 
Huang et al. HH present methods to speed up the refinement step of the spatial 
overlay join by either introducing an additional filter step which is based on a 
progressive approximation or exploiting symbolic intersection detection. Both 
approaches allow the early identification of some pairs of intersecting objects 
without investigating the exact shapes. However, they do not solve the map 
overlay problem. 

In this paper we discuss how well known main-memory algorithms from 
computational geometry can help to perform map overlay and spatial over- 
lap join in a GIS. More specifically, we show how the algorithm by Nievergelt 
and Preparata 1241 and a new modification of Chan’s line segment intersec- 
tion algorithm m can be extended to cope with massive real-world data sets. 
Furthermore, we present an experimental comparison of these two algorithms 
showing that the algorithm by Nievergelt and Preparata performs better than 
the modified algorithm by Chan, if the number of intersection points is sub- 
linear, although Chan’s algorithm outperforms other line segment intersection 
algorithms |E]- In this paper we do not consider algorithms using a network ori- 
ented representation, e.g., 1 1 Ifi 411 tip2,'tj . since many GIS support only polygon 
oriented representations, e.g., ARC/INFO 

The remainder of this paper is structured as follows. In Section El we review 
the standard algorithm by Nievergelt and Preparata for overlaying two sets of 
polygonal objects and show how to modify Chan’s algorithm for line intersection 
in order to perform map overlay. In Section El we discuss extensions required to 
use these algorithms for practical massive data sets. The results of our experi- 
mental comparison are presented in Section 0 



2 Overlaying Two Sets of Polygonal Objects 

Since the K intersection points between line segments from different input maps 
{N line segments in total) are part of the output map, line segment intersection 
has been identified as one of the central tasks of the overlay process E] . Methods 
for finding these intersections can be categorized into two classes: algorithms 
which rely on a partitioning of the underlying space, and algorithms exploiting 
a spatial order defined on the segments. While representatives of the first group, 
e.g., the method of adaptive grids ini, tend to perform very well in practical sit- 
uations, they cannot guarantee efficient treatment of all possible configurations. 
The worst-case behavior of these algorithms may match the complexity of a 
brute-force approach, whereas the behavior for practical data sets often depends 
on the distribution of the input data in the underlying space rather than on the 
parameters N and K . Since we will examine the practical behavior of overlay 
algorithms under varying parameters N and K (Section^, we restrict our study 
to algorithms that do not rely on a partitioning of the underlying space. 
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Several efficient algorithms for the line segment intersection problem have 
been proposed I4l6ll2ll8l2dl : most of these are based on the plane-sweep para- 
digm m, a framework facilitating the establishment of a spatial order. 

The characterizing feature of the plane-sweep technique is an (imaginary) 
line that is swept over the entire data set. For sake of simplicity, this sweep-line 
is usually assumed to be perpendicular to the x-axis of the coordinate system 
and to move from left to right. Any object intersected by the sweep- line at x = t 
is called active at time t, and only active objects are involved in geometric com- 
putations at that time. These active objects are usually stored in a dictionary 
called y-table. The status of the y-table is updated as soon as the sweep-line 
moves to a point where the topology of the active objects changes discontin- 
uously: for example, an object must be inserted in the y-table as soon as the 
sweep- line hits its leftmost point, and it must be removed after the sweep- line 
has passed its rightmost point. For a finite set of objects there are only finitely 
many points where the topology of the active objects changes discontinuously; 
these points are called events and are stored in increasing order of their x- 
coordinates as an ordered sequence, e.g, in a priority queue. This data structure 
is also called x-queue. Depending on the problem to be solved there may exist 
additional event types. 

In the following two sections we recall the plane-sweep algorithm by Bentley 
and Ottmann ^ and its extension by Nievergelt and Preparata |25 to compute 
the overlay of polygonal objects. These algorithms exploit an order established 
during the plane-sweep: for a set of line segments we can define a total order 
on all segments active at a given time by means of the aboveness relation. An 
active segment a lying above an active segment b is considered to be “greater” 
than b. This order allows for efficient storage of all active segments in an ordered 
dictionary (y-table), e.g., in a balanced binary tree which in turn provides for 
fast access to the segments organized by it. 

2.1 The Line Segment Intersection Algorithm by Bentley and 
Ottmann 

The key observation for finding the intersections is that two segments intersecting 
at X = to must be direct neighbors at x = to — e for some e > 0. To find all 
intersections it is therefore sufficient to examine all pairs of segments that become 
direct neighbors during the sweep. The neighborhood relation changes only at 
event points and only for those segments directly above or below the segments 
inserted, deleted, or exchanged at that point. Thus, we have to check two pairs of 
segments at each insert event and one pair of segments at each delete event. If we 
detect an intersection between two segments we do not report it immediately but 
insert a new intersection event in the x-queue. If we encounter an intersection 
event we report the two intersecting segments and exchange them in the y-table 
to reflect the change in topology. The neighbors of these segments change, and 
two new neighboring pairs need to be examined. Since there is one insert and 
one delete event for each of the N input segments and one event for each of the 
K intersection points the number of events stored in the x-queue is bounded 
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by 0{N + K). At most N segments can be present in the y-table at any time. 
If we realize both the a;-queue and the y-table so that they allow updates in 
logarithmic time, the overall running time of the algorithm can be bounded 
by 0{{N + K) logN). Brown ^2] showed how to reduce the space requirement 
to 0{N) by storing at most one intersection event per active segment. Pach and 
Sharir PZI achieve the same reduction by storing only intersection points of pairs 
of segments that are currently adjacent on the sweep- line. 



2.2 The Region-Finding Algorithm by Nievergelt and Preparata 

Two neighboring segments in the y-table can also be considered as bounding 
edges of the region between them. This viewpoint leads to an algorithm for 
computing the regions formed by a self-intersecting polygon as proposed by 
Nievergelt and Preparata m- Each active segment s maintains information 
about the regions directly above and directly below in two lists A(s) and B{s). 
These lists store the vertices of the regions swept so far and are updated at 
the event points. Nievergelt and Preparata define four event types as illustrated 
in Figure 0 




bend 




Fig. 1. Four event types in Nievergelt and Preparata’s algorithm m- 



At each event point p the segments s and t are treated as in the line segment 
intersection algorithm described in Section f‘2. 1 L In addition, new regions are cre- 
ated at start and intersection events, while existing regions are closed or merged 
at intersection and end events. Bend events cause no topological modifications 
but only the replacement of s by t and the augmentation of the involved vertex 
lists by p. With a careful implementation the region lists can be maintained 
without asymptotic overhead so that the algorithm detects all regions induced 
by N line segments and K intersection points in 0{{N -f K)\ogN) time and 
0{N + K) space. 

Although there are only 0{N) active segments at any given time, the combi- 
natorial complexity of all active regions, i.e., the total number of points on their 
boundaries, is slightly higher. 
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Lemma 1. The combinatorial complexity of active regions at any given time 
during the region-finding algorithm is bounded by 0{Na{N)), where a{N) is the 
inverse Ackermann function. 

Proof, m Consider the sweep-line at time t, cut all segments crossing the sweep- 
line at X = t, and discard all segments and fragments to the right of the sweepline. 
This results in an arrangement of at most N segments. All active regions now 
belong to one single region of this arrangement, namely to the outer region. 
Applying a result by Sharir and Agarwal [,41)1 Remark 5.6] we immediately obtain 
an upper bound of 0{Na{N)) for the combinatorial complexity of the active 
regions to the left of the sweep-line. The same argument obviously holds for the 
combinatorial complexity of the active regions to the right of the sweep-line. 

Taking into account Lemma ^ we can bound the space requirement for 
the region-finding algorithm by 0{Na{N)). To do so we modify the Bentley- 
Ottmann algorithm according to Brown’s comments PH and report all regions 
as soon as they have been swept completely. 

To use this algorithm for overlaying a set of red polygonal objects and a set of 
blue polygonal objects we store the endpoints of the corresponding blue and red 
segments in the x-queue in sorted order. During the sweep we can determine for 
each computed region whether it is an original region, i.e., a red or blue region, 
or resulting from the intersection of a red and a blue region. In the latter case, we 
can combine the attributes of two overlapping regions by an application-specific 
function to obtain the attribute of each new region if we store the attribute of 
each region for each of its bounding segments. 



2.3 The Red/Blue Line Segment Intersection Algorithm by Chan 

The algorithm described in the previous section can be used to determine the 
intersection points induced by any set of line segments. In many GIS applications, 
however, we know that the line segments in question belong to two distinct sets, 
each of which is intersection- free. Chan proposed an algorithm that computes 

all K intersection points between a red intersection-free set of line segments and 
a blue intersection-free set of line segments with a total number of N segments 
in optimal 0{N log N-\-K) time and 0{N) space. Since this algorithm forms the 
basic building block of an overlay algorithm which we present in the following 
sections, we explain it in more detail than the previous algorithms. 

The asymptotic improvements are achieved by computing intersection points 
by looking backwards at each event instead of computing intersection points in 
advance and storing them in the x-queue. Since intersection events are used to 
determine intersection points and to update the status of the sweep-line (via 
exchanging the intersecting segments) it is not possible to maintain both red 
and blue segments in a single y-table without storing intersection events. Chan 
solves this conflict by introducing two y-tables, one for each set of line segments. 
Since segments within each set do not intersect both y-tables can be maintained 
without exchanging segments. The events, i.e., the endpoints of the segments, 
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are maintained in one x-queue, and depending on the kind and color of the event 
point the segment starting or ending at that point is inserted in or deleted from 
the corresponding y-table. For the analysis of the time complexity we assume 
that the table operations insert, delete, and find can be handled in 0(log N) time, 
and that the successor and predecessor of a given table item can be determined 
in constant time. 

The algorithm can be explained best by looking at the trapezoidal decompo- 
sition of the plane induced by the blue segments and the endpoints of the blue 
and the red segments. Each endpoint produces vertical extensions ending at the 
blue segments directly above and below this point. These extensions are only 
imaginary and are neither stored nor involved in any geometric computation 
during the algorithm. Figure 0a) shows the trapezoidal decomposition induced 
by a set of blue segments (solid lines) and the endpoints of both these and the 
red segments (dashed lines). 




(a) Trapezoidal decomposition. 




Fig. 2. Trapezoid sweep as proposed by Chan H2!. 



At each event p the algorithm computes all intersections between the active 
red segments and the boundaries of the active blue trapezoids that lie to the left 
or on the sweep-line and have not been reported yet. The trapezoids whose right 
boundaries correspond to the event p are then added to the imaginary sweep- 
front, the collection of trapezoids that have been swept completely so far. An 
example is depicted in Figure 0b): all active trapezoids are shaded light grey, 
and the sweep-front is shaded dark grey. After having processed the event p the 
trapezoids T\ and T 2 will be added to the sweep- front. 

The blue boundary segments of the trapezoid(s) ending at p are processed in 
ascending order: starting from the red segment directly above the current blue 
segment we check the active red segments in ascending order for intersections to 
the left of the sweep-line with the current blue segment and its predecessors in 
the blue y-table. The algorithm locates each red segment in the blue y-table and 
traces it to its left endpoint while reporting all relevant intersections with active 
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blue segments. Note that active red segments and active blue segments might 
intersect outside the active trapezoids, i.e., on the boundary of trapezoids already 
in the sweep-front (for example, the intersection points [4] and [6] in Figure0b)). 

To avoid multiple computation of such points, each active red segment stores 
its intersection point with the largest ^-coordinate (if any) and stops finding 
intersections for a given red segment as soon as this most recent intersection point 
is detected again, or if no intersection between a red segment and the current blue 
segment has been found. In the first situation the algorithm starts examining 
the next active red segment, whereas in the latter situation we know that no red 
segment above the current segment can have an undetected intersection with this 
blue segment. Therefore the red segment directly below the current blue and its 
predecessors in the red y-table are tested for intersection with the current blue 
segment. If these tests are finished we process the next blue boundary segment 
of a trapezoid ending at p. In Figure Etb) the intersection points are labeled 
according to the order in which they are detected: black squares indicate new 
intersection points, and white squares indicate intersection points found during 
the processing of earlier events. 

Since each of the N segments requires one start event and one end event and 
is stored in exactly one y-table, the space requirement for this algorithm is 0{N). 
Initializing the a;-queue and maintaining both y-tables under N insertions and 
N deletions can be done in 0{N\ogN) time. To find the red segment directly 
above each event point and to locate this segment in the blue y-tahle we need 
additional 0{logN) time per event. Tracing a red segment through the blue 
table takes 0{1) time per step. Except for the last step per segment each trace 
step corresponds to a reported intersection, and there are at most two segments 
examined per event that do not contribute to the intersections reported at that 
event. Thus, the number of trace steps performed at one event point is of the same 
order as the number of intersections reported at that event point resulting in a 
total number of 0{K) trace steps during the algorithm. The time complexity of 
this line segment intersection algorithm is 0{N log N+K), which is optimal 
For a complete description we refer the reader to the original paper na. 

2.4 Region-Finding Based upon Red/Blue Line Segment 
Intersection 

The key to an efficient region- finding algorithm based upon Chan’s algorithm is 
to use the lists A(s) and B{s) introduced by Nievergelt and Preparata. Recall 
that these lists store the vertices of the regions above and below each active seg- 
ment s and are updated at the event points. Updating these lists at endpoints 
of segments or intersection points is done analogously to the algorithm of Niev- 
ergelt and Preparata. Since there are no explicit intersection events in Chan’s 
algorithm, we have to ensure that these updates are done in the proper order. 
Consider the situation in Figure a): if we try to process the intersection points 
in the standard order we are not able to describe the region ending at intersec- 
tion point 1 since we do not know the intersection points 2 and 3 also describing 
the region (intersection point [4] has been detected at a previous event). If we 
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had detected the points in the order shown in Figure 0b) we would have found 
the intersection points 4 and 5 prior to intersection point 6. To do so, we only 
need to report the intersection points in reverse order. It is not hard to show 
that intersection points are then reported in lexicographical order along each 
blue and red segment, and thus that all events describing a region are reported 
in the proper order. We omit this proof due to space constraints. 





Fig. 3. Intersection points detected during region- finding. 



As with the algorithm by Nievergelt and Preparata, maintaining the region 
list does not asymptotically increase the time spent per event, and thus we can 
compute the overlay of two polygonal layers in optimal 0{NlogN + K) time. 
According to Lemma [D the space requirement can be bounded by 0{Na{N)). 

3 Extensions to the Overlay Algorithms to Handle 
Real-World Data 

In this section we show how the region-finding algorithms of Section El can be 
extended to handle real-world configurations. In general, these data sets are 
“degenerate” in a geometric sense, i.e., there are many segments starting or 
ending at a given node, endpoints may have identical x-coordinates, or segments 
from different layers overlap. Section describes how to handle such situations, 
and in Section IT^l we discuss how to deal with massive data sets that are too 
large to fit completely into main memory. 

3.1 Handling Non-general Configurations 

The most frequently occurring non-general configuration consists of multiple 
endpoints having the same x-coordinate. Such situations can be handled easily 
if we implicitly rotate the data set using a lexicographical order of the end- 
points. Other non-general configurations include multiple segments meeting at 



278 Ludger Becker et al. 



one point, overlapping segments, or endpoints lying on the interior of other seg- 
ments. In the following discussion we assume that all geometric computations 
in the algorithms can be carried out in a numerically robust way (as for exam- 
ple described by Brinkmann and Hinrichs 0 or by Bartuschka, Mehlhorn, and 
Naher p|) and thus, that all non-general configurations can be detected consis- 
tently. A recent approach to handle degenerate cases in map-overlay has been 
presented by Chan and Ng El. Their algorithm requires two phases: first all 
intersections are computed by some line segment intersection algorithm (thereby 
passing the main part of the care for degeneracies to that algorithm), and then 
all sorted event points (including intersection points) are processed in a plane- 
sweep. Since their algorithm does not improve the 0{{N + K) log TV) time bound 
of Nievergelt and Preparata’s algorithm and requires a separate intersection al- 
gorithm we do not consider it in this paper. 

Figure El) a) depicts a situation where an end event and an intersection event 
coincide at some point p. No matter which of these events is handled first the al- 
gorithm by Nievergelt and Preparata lSection i‘2.2l) does not construct the region 
bounded by si and t\ and the region bounded by t\ and S 2 - Kriegel, Brinkhoff, 
and Schneider 123 solved this problem by replacing the four event types by two 
more general situations: first, all segments ending at a given point are processed 
in clockwise order (end event) and then all segments starting at that point are 
processed in counterclockwise order (start event). Bend events and intersections 
events are handled as a pair of one end and one start event. 




Fig. 4. Non-general configurations. 



However, for the map overlay operation there are two more details that de- 
serve special attention: in the situation of Figure 0)b) we have two overlapping 
segments p\P 2 and pip^ of different colors and a red segment passing through 
a blue endpoint p 4 . To avoid topological inconsistencies in the construction and 
attribute propagation of the empty region between piP 2 and piPs we modify 
these two segments such that there are two segments piP 2 and P 2 P 3 by adjusting 
the left endpoint of the longer segment pips- The resulting unique segment piP 2 
needs to store the names of both the blue region above and the red region be- 
low. To detect segments passing through an endpoint, at each event p we locate 
the current endpoint in the y-table of the other color and check whether it lies 
on an open segment. In this case we split that segment into two new segments 
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(one ending at p and one starting at p) and proceed as usual. It is easy to see 
that these modifications guarantee topological consistency and require only a 
constant number of operations per event, thereby not affecting the asymptotic 
time complexity. 



3.2 Handling of Massive Data Sets 

The handling of massive data sets is a well-known problem in GIS and the 
primary focus of investigation in the field of external memory algorithms (see 
Vitter [ifl| for a recent survey) . In this section we explain how the algorithms for 
the overlay of polygonal maps presented above can be used for the map overlay 
even when dealing with massive data sets. 

As noticed by several authors |3I19I2(H real-world data sets from several ap- 
plication domains seem to obey the so-called “-s/iV-rule” : for a given set of N line 
segments and a fixed vertical line L there are at most 0{-\fN) intersections be- 
tween the line segments and the line L. As a result, data sets obeying this rule 
have at most 0{-\fN) active segments that need to be stored in an internal mem- 
ory j/-table. The TIGER/Line data set for the entire United States is estimated to 
contain no more than 50 million segments 0|, and we have y/50, 000, 000 « 7, 000. 

Both overlay algorithms presented in Section O allow for space-efhcient stor- 
age of the computed overlay. As soon as a region of the resulting layer has been 
swept completely it does not need to be present in internal memory anymore 
and can be transferred to disk. As long as no resulting region contains too many 
boundary segments, there is a fair chance that even for massive data sets both 
active segments and active regions fit into main memory. In contrast, if we aim 
at constructing a network oriented layer for the resulting partition we are 

forced to keep the complete data structure in main memory. 

Gonceptually, the a;-queue can be thought of as an input stream that feeds 
the algorithm, and due to the size of the input data set this stream will be 
resident on secondary storage. To optimize access to this x-queue its elements 
should be accessed sequentially, i.e., one should avoid random access as needed 
for update operations. The basic algorithm by Nievergelt and Preparata does 
not fulfil this property since intersection events are detected while sweeping and 
need to be inserted in the a;-queue. However, by applying Brown’s modification 
and thus storing at most one intersection event per active segment we can avoid 
such updates during the sweep. Instead, we store all detected intersection events 
in an internal memory priority queue. Whenever we find a new intersection 
event p, say an intersection between the segments s and t, we check whether 
the events possibly associated with s and t are closer to the sweep-line than p. 
If any of these event lies farther away from the sweep-line than p, we remove 
this event from the internal memory priority queue and insert p instead. At each 
transition the sweep algorithm has to choose the next event to be processed by 
comparing the first element in the x-queue and the first element in the priority 
queue storing the intersection events. This internal memory priority queue can 
be maintained in 0{{N+K) log N) time during the complete sweep and occupies 
0{-\fN) space under the assumption that the data sets obey the “-s/iV-rule” . Our 
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modification of Chan’s algorithm does not require such an additional internal 
memory structure since there is no need for updating the a;-queue during the 
sweep. 

If we use the extension proposed in Section tt. II for handling non-general 
configurations, we are forced to modify endpoints of active segments or to split 
segments which triggers the need for updating the x-queue. This situation can be 
handled by using an internal memory priority queue similar to the one proposed 
above. If we encounter two overlapping segments, we adjust the left endpoint 
of the longer one and insert it in the internal memory event queue. Since both 
original segments are active we simply move this modified segment from the 
y-table to the priority queue without increasing internal memory space require- 
ments. For each segment split due to handling non-general configurations, we can 
charge one endpoint of the active segments coincident with the split point for 
the new segment inserted into the priority queue. Since each active segment can 
contribute to at most two splits the internal memory space requirement remains 
asymptotically the same. Synchronizing the x-queue and the priority queue is 
done as described above. 



4 Empirical Results 

In this section we discuss our implementation of the algorithms described in 
Section 1^3 and Section lT^ a.nd the results of experiments with artificial and real- 
world data sets. Our focus is on the comparison of algorithms having different 
asymptotical complexities. The comparison is based on real-world situations and 
artificial situations simulating special cases. According to Section IT^ we expect 
identical I/O cost for both algorithms. Hence, we ran the algorithms on data 
sets ranging from small to quite large comparing the absolute internal memory 
time complexities of both algorithms. 

The algorithms have been implemented in C-|— 1-, and all experiments were 
conducted on a Sun"’’'^ Ultra 2 workstation running Solaris 2.5.1. The work- 
station was equipped with 640 MB main memory, to ensure that even without 
implementing the concepts from Section fit. 'J a, 11 operations are performed without 
causing page faults or involving I/O operations. For a more detailed description 
of the implementation we refer the reader to Giesen HS|. 



4.1 Artificial Data Sets 

The first series of experiments concentrated on the behavior of both algorithms 
for characteristic values of A. To this end we generated three classes of input 
data with K = 0, K G 0{N), and K G 0(N^) intersection points, respectively, 
as shown in Figure 0 The results of the experiments are presented in Table 0 
As a careful analysis shows the algorithm from Section 12.41 performs three 
times as many intersection tests as the algorithm by Nievergelt and Preparata 
resulting in a slow-down by a factor of three. 
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(a) K = Q. 




(b) K G 0{N). 




(c) K G 0[N^). 



Fig. 5. Artificial data sets. 



For K G 0{N) the more efficient reporting of our algorithm cancels out 
that drawback, and for K G 0{N^) we see a clear superiority of our algorithm 
reflecting the theoretical advantage of C>(A^logiV + iV^) = 0{N‘^) versus 0{{N + 
N'^) log N) = 0{N^ log N). The speed-up varies between factors of 10.4 and 17.8, 
and the more segments (and thus intersection points) are involved the more our 
algorithm improves over Nievergelt and Preparata’s algorithm. 
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Table 1. Overlaying artificial maps (time in seconds). 



4.2 Real-World Data Sets 

After having examined the behavior of the two algorithms under varying param- 
eters N and K, we tested the described algorithms for real-world data sets. We 
ran a series of experiments with polygonal maps of cities and counties from the 
state of North-Rhine Westphalia (NRW) and Germany (GER). Vectorized maps 
obtained from different distributors are likely to unveil different digitizing accu- 
racies, and as can be seen from Figure El such differences bring about so-called 
“sliver polygons” . 
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Fig. 6. Real-world data set: City of Munster, Germany. 



Detecting and post-processing sliver polygons is an important issue in map 
overlay, and there are specialized methods for this task depending on the appli- 
cation at hand. At this point we only mention that the region-finding algorithms 
can be extended in a straightforward way to serve as a filter for sliver polygons 
because they produce all regions while sweeping. 

To simulate additional sliver polygons we performed experiments where we 
overlayed a data set with a slightly translated second data set. The running 
times for overlaying real-world maps are summarized in Table |21 
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Table 2. Overlaying real-world maps (time in seconds, “-I-” indicates translated 
data set). 



These figures show an advantage for Nievergelt and Preparata’s algorithm 
by a factor of up to 2.3. This result could have been preempted by the results 
from Section EH however, since there are less than linearly many intersection 
points in all of these experiments. In analogy to the above observations, our 
proposed algorithm improves with increasing number of segments involved. 

Summarizing the experiments from both this section and Section 14 . 1 1 we see 
an advantage for the conventional method by Nievergelt and Preparata as long 
as there are at most linearly many intersection points. For large data sets this 
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advantage becomes smaller, and there is a clear superiority for our proposed 
algorithm when dealing with data sets producing more than linearly many in- 
tersection points. 



5 Conclusions 

In this paper we have examined algorithms for performing map overlay opera- 
tions with massive real world-data sets. We have proposed to modify plane-sweep 
algorithms to cope with such data sets. It is obvious that such algorithms may 
also be used to implement the refinement step of a spatial join efficiently. How- 
ever, this approach is not useful if the filter step produces only few candidates 
which must be checked in the refinement step, since the plane-sweep algorithms 
may require to store theses candidates on disk for sorting. We have not yet ad- 
dressed this problem in our research, but obviously it is not efficient to use a 
plane-sweep algorithm if each polygon of the two input maps only participates 
in at most one candidate pair. 

Our experiments have shown that the modified algorithm of Chan performs 
better than the algorithm of Nievergelt and Preparata, if there are more than 
linearly many intersections points. On the other hand the algorithm of Niev- 
ergelt and Preparata performs better than the algorithm of Chan, if there are 
sublinearly many intersection points. 

In our future work we will focus on the realization of the methods proposed 
in Section rO and on the combination with existing algorithms for the filter step 
of the spatial join operation. This also includes a comparison of using plane- 
sweep algorithms in the refinement step with a simple checking of each pair 
reported by the filter step for an intersection. 
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Abstract. We provide an evaluation of query execution plans (QEP) 
in the case of queries with one or two spatial joins. The QEPs assume 
R*-tree indexed relations and use a common set of spatial joins algo- 
rithms, among which one is a novel extension of a strategy based on an 
on-the-fly index creation prior to the join with another indexed relation. 
A common platform is used on which a set of spatial access methods and 
join algorithms are available. The QEPs are implemented with a general 
iterator-based spatial query processor, allowing for pipelined QEP exe- 
cution, thus minimizing memory space required for intermediate results. 



1 Introduction 



It is well known that the application of Database Management Systems (DBMS) 
join techniques, such as sort-merge, scan and index, hash join and join indices, 
to the context of spatial data is not straightforward. This is due to the fact 
that these techniques, as well as B-tree-based techniques, intensively rely on the 
domain ordering of the relational attributes, which ordering does not exist in 
the case of multi-dimensional data. 

A large number of spatial access methods (SAM) have been proposed in the 
past fifteen years as well as a number of spatial join algorithms 

the adaptation of well-known join strategies to the particular requirements of 
spatial joins. 

These strategies have been validated through experiments on different plat- 
forms, with various methodologies, datasets and implementation choices. The 
lack of a commonly shared performance methodology and benchmarking ren- 
ders difficult a fair comparison among these numerous techniques. 

The methodology and evaluation are crucial not only for the choice of a few 
efficient spatial join algorithms but also for the optimization of complex queries 
involving several joins in sequence (multi-way joins). In the latter more general 
case, the generation and evaluation of complex query execution plans (QEP) is 
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central to optimization. Only a few papers study the systematic optimization of 
spatial queries containing multi-way joins mm- 

The objective of this paper is two fold: (i) to provide a common framework 
and evaluation platform for spatial query processing, and (ii) to use it to exper- 
imentally evaluate spatial join processing strategies. 

A complex spatial query can be translated into a QEP with some physical 
operations such as data access (sequential or through an index), spatial selection, 
spatial join, sorting, etc. A QEP is then represented as a binary tree in which 
leaves are either indices or data files and internal nodes are physical operators. 

We use as a model for spatial query processing the pipelined execution of 
such QEPs with each node (operation) being implemented as an iterator |Gra93j . 
This execution model provides a sound framework: it encompasses spatial and 
non-spatial queries, and allows to consider in an uniform setting simple and 
large complex queries involving several consecutive joins. Whenever possible, 
records are processed one-at-a-time and transfered from one node to the follow- 
ing, thereby avoiding the storage of intermediate results. 

Such an execution model is not only useful to represent and evaluate complex 
queries, but also to specify and make a fair comparison of simple ones. Indeed, 
consider a query including a single spatial join between two relations. The join 
output is, unfortunately, algorithm dependent. Some algorithms provide as an 
output a set of pairs of record identifiers (one per relation), others, such as the 
so-called Scan and Index (SAI) strategy provide a set in which each element 
is composed of a record (of the first relation) and the identifier of a record in 
the second relation. Then to complete the join, the former case requires two 
data accesses, while only one data access is necessary in the latter case. This 
example illustrates the necessity for a consistent comparison framework. The 
above execution model provides such a framework. 

Another advantage of this execution model is that it allows not only to com- 
pare two QEPs on their time performance but also on their memory space re- 
quirement. Some operations cannot be pipelined, e.g., sorting an intermediate 
result, and require the completion of an operation before starting the following 
operation. Such operators, denoted blocking iterators in this paper, are usually 
memory-demanding and raise some complex issues related to the allocation of 
the available memory among the nodes of a QEP. In order to make a fair compar- 
ison between several QEPs, we shall always assign the same amount of memory 
to each QEP during an experiment. 

The study performed in this paper is a first contribution to the evaluation 
of complex spatial queries that may involve several joins in sequence (multi-way 
joins). Based on the above model we evaluate queries with one or two spatial 
joins. We make the following assumptions: (i) all relations in the database are 
indexed on their spatial attribute, (ii) we choose the R*-tree IBKSS90j for all 
indices, (iii) the index is always used for query optimization. While the first 
assumption is natural, the second one is restrictive. Indeed, while the R*-tree 
is an efficient SAM, there exists a number of other data structures that deserve 
some attention, among which it is worth noting the grid based structures derived 
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from the grid file |N HS84| . The third assumption is also restrictive since it does 
not take into account the proposal of several techniques for joining non indexed 
relations. 

The comparison of QEPs as defined above has been done on a common 
general platform developed for spatial query processing evaluation. This platform 
provides basic I/O and buffer management, a set of representative SAMs, a 
library of spatial operations, and implements a spatial query processor according 
to the above iterator model using as nodes the SAMs and spatial operations 
available. 

The rest of the paper is organized as follows. Section O briefly surveys the 
various spatial join techniques proposed in the literature and summarizes related 
work. The detailed architecture of the query processor is presented in Sectional 
Section 0 deals with our choices for spatial join processing and the generated 
QEPs. Section ^reports on the experiment, the datasets chosen for the evaluation 
and the results of the performance evaluation. Some concluding remarks are 
given in Section 0 

2 Background and Related Work 

We assume each relation has a spatial attribute. The spatial join between re- 
lations i?i and i ?2 constructs the pairs of tuples from i?i x i ?2 whose spatial 
attributes satisfy a spatial predicate. We shall restrict this study to intersect 
joins, also referred to as overlap joins. Usually, each spatial attribute has for a 
value a pair {MBR, spatial object representation) , where MBR is the minimum 
bounding rectangle of the spatial object. Intersect spatial joins are usually com- 
puted in two steps. In the filter step the tuples whose MBR overlap are selected. 
For each pair that passes the filter step, in the refinement step the spatial object 
representations are retrieved and the spatial predicate is checked on these spatial 
representations |BKSS94j . 

Many experiments only consider the filter step. This might be misleading for 
the following reasons: first one cannot fairly compare two algorithms which do 
not yield the same result (for instance if the SAI strategy is used, at the end 
of the filter step, part of the record value has been already accessed, which is 
useful for the refinement step, while it is not true with the STT strategy 
tBKS93j h second by considering only the filter step, one ignores its interactions 
with the refinement step, for instance in terms of memory requirements. We shall 
include in our experiments all the operations necessary to retrieve data from 
disk, whether this data access is for the filter step, or the refinement step. Only 
the evaluation of the computational geometry algorithm on the exact spatial 
representation, which is equivalent whatever the join strategy, will be excluded. 

SAMs can be roughly classified into two categories: 

— Space driven structures, among which grids and quadtrees are very popular, 
partition the tuples according to some spatial scheme independent from the 
spatial data distribution of the indexed relation. 
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— Data-driven structures, on the other hand, adapt to the spatial data distribu- 
tion of tuples. The most popular SAM of this category is the R-tree mnn. 
The R+-tree EEEHZI, the R*-tree lUKSSlldl and the X-tree are 

improved versions of the R-tree. These dynamic SAMs maintain their struc- 
ture on each insertion/deletion. In the case of static collections which are 
not often updated, packing algorithms lf{,L85l IKEDdl ILEL971 build optimal 
R-trees, called packed R-trees. 

Spatial joins algorithms can be classified into three categories depending on 
whether each relation is indexed or not. 

1. no indcT. for the case where no index exists on any relation, several par- 

titioning techniques have been proposed which partition the tuples into 
buckets and then use either hashed based techniques or sweep-line tech- 
niques IAPR+981 . 

2. two indices: when both relations are indexed, the algorithms that have been 

proposed depend on the SAM used. is the first known work on spatial 

joins. It proposes a I-dimensional ordering of spatial objects, which are then 
indexed on their rank in a B-tree and merge-joined. |(Iun9d| was the first pro- 
posal of an algorithm called Synchronized Tree Traversal (STT) which adapts 
to a large family of spatial predicates and tree structures. The STT algo- 
rithm of IHk.Su:il is the most popular one because of its efficiency. Proposed 
independently from |(Iun na, it uses R*-trees and an efficient depth-first tree 
traversal of both trees for intersection joins. The algorithm is sketched below. 

Algorithm STT (Node Ni, Node N 2 ) 

begin 

for all (ei in Ni) 

for all (c 2 in N 2 ) such that ci.M BRC] C 2 -MBR ^ 0 
if (the leaf level is reached) then 
output (61,62) 
else 

N[ = readPage {ci.pagelD)-, N 2 = readPage {e 2 -pageID)] 
STT(NT, N’2) 

endif 

end 



Advanced variants of the algorithm apply some local optimization in order 
to reduce the CPU and I/O costs. In particular, when joining two nodes, the 
overlapping of entries is computed using a plane-sweeping technique instead 
of the brute-force nested loop algorithm shown above. The MBRs of each 
node are sorted on the ^-coordinate, and a merge-like algorithm is carried 
out. This is shown to significantly reduce the number of intersection tests. 

3. single index : when only one index exists, the simplest strategy is the Scan 
And Index (SAI) strategy, a variant of the nested loop algorithm which scans 
the non-indexed relation and for each tuple r delivers to the index of the other 



290 Apostolos Papadopoulos, Philippe Rigaux, and Michel Scholl 

relation a window query with r.MBR as an argument. The high efficiency 
of the STT algorithm suggests that an “on-the-fly” construction of a second 
index, followed by STT, could compete with SAL This idea has inspired the 
join algorithm of which constructs a seeded-tree on the non indexed 

relation which is a R-tree whose first levels match exactly those of the existing 
R-tree. It is shown that the strategy outperforms SAI and the naive on-the- 
fly construction of an R-tree with dynamic insertion. An improvement of 
this idea is the SISJ algorithm of mm\ . An alternative is to build a packed 
R-tree by using bulk- load insertions IIT!Lf)7l . Such constructions optimize 
the STT algorithm since they reduce the set of nodes to be compared during 
traversal. These algorithms are examples of strategies referred to as Build- 
and-Match strategies in the sequel. 

The complexity of the spatial join operation, the variety of techniques and 
the numerous parameters involved in a spatial join render extremely difficult the 
comparison between the above proposals briefly sketched above. 

Only a few attempts have been made toward a systematic comparison of spa- 
tial join strategies. |OOP+9^ is a preliminary attempt to integrate in a common 
platform the evaluation of spatial query processing strategies. It proposes a web- 
based rectangle generator and gives first results on the comparison of three join 
strategies: nested loop, SAI and STT. The major limit of this experiment is that 
it is built on top of an existing DBMS. This not only limits the robustness of 
the results but renders impossible or inefficient the implementation of complex 
strategies, the tuning of numerous parameters and a precise analysis. 

[IMP99j is the first study on multi-way spatial joins. It proposes an itera- 
tor pipelined execution of QEPs for multi-way spatial joins, with three 

join algorithms, one per category, namely STT, SISJ, and a spatial hash-join 
technique An analytical model predicts the cost of QEPs, a dynamic 

programming algorithm for choosing the optimal QEP is proposed. The query 
optimization model is validated through experimental evaluation. 

The modeling of QEPs involving one or several joins in the study reported 
below follows the same pipelined iterator based approach. This execution model 
is implemented on a platform common to all evaluations. This platform allows 
for fine tuning of parameters impacting the strategies performance and is general 
enough to implement and evaluate any complex QEP: it is not limited to spatial 
joins. Last but not least, such a model and its implementation allow for various 
implementation details generally absent from evaluations reported in the litera- 
ture. The relation access after a join is an example of implementation “detail” 
which accounts for an extremely significant part of the query response time as 
shown below. 



3 The Query Processor 

The platform has been implemented in C-|— I- and runs on top of UNIX or Win- 
dowsNT. Its architecture is shown in Fig. n It is composed of a database and 
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three modules which implement some of the standard low-level services of a 
centralized DBMS. 

The database is a set of binary files. Each binary file either stores a data file 
or a SAM. A data file is a sequential list of records. A SAM or index refers to 
records in an indexed data file through record identifiers. The lowest level module 
is the I/O module, which is in charge of reading (writing) pages from (to) the 
disk. The second module manages buffers of pages fetched (flushed) from (to) 
the disk through the I/O module. On-top of the buffer management module is 
the query processing module, which supports spatial queries. 



QEP 





Fig. 1. Platform architecture 



Database and Disk Access 

The database (data files and SAM) is stored in binary files divided into 
pages whose size is chosen at database creation. A page is structured as a header 
followed by an array of fixed-size records which can be either data records or 
index entries. The header and record sizes depend on the file. By knowing the 
record size, one can compute the number of records per page and the data file 
size. 

Each page is uniquely identified by its PagelD (4 bytes). A record is identi- 
fied by a RecordID (8 bytes) which is a pair [PagelD, offset] where offset 
denotes the record offset in the page. 

1. Data files are sequential collections of pages storing data records. In the cur- 
rent setting, a record basically is a binary representation of a spatial object. 
From the query processing point of view, the most important information 
stored in a record is its geometric key, which is, throughout this experiment, 
its MBR. A data file can either be accessed sequentially {FileScan in the 
sequel) , or by RecordID (RowAccess in the sequel). It is important to note 
that the datafiles are not clustered on their geometric representation (i.e., 
objects close in space are not necessarily close on disk). 
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2. SAMs are structured collections of IndexEntry. An index entry is a pair 
[Key, RecordID] , where Key denotes the geometric key (here the MBR) 
and RecordID identifies a record in the indexed data file. The currently 
implemented SAMs are a grid file, an R-tree, an R*-tree and several packed 
R-trees. In the sequel, each datafile is indexed with an R*-tree. 



Buffer Management 

The buffer manager handles one or several buffer pools: a data file or index 
(SAM) is assigned to one buffer pool, but a buffer pool can handle several indices. 
This allows much flexibility when assigning memory to the different parts of a 
query execution plan. The buffer pool is a constant-size cache with LRU or FIFO 
replacement policy (LRU by default). Pages can be pinned in memory. A pinned 
page is never flushed until it is unpinned. 

Currently, all algorithms requiring page accesses uniformly access these pages 
through the interface provided by the buffer manager. In particular, spatial join 
algorithms share this module and therefore cannot rely on a tailored main mem- 
ory management or a specialized I/O’s policy unless it has already been imple- 
mented in this module. 

Query Processing Module 

One of the important design choices for the platform is to allow for any ex- 
perimental evaluation of query execution plans (QEP) as generated by database 
query optimizers with an algebraic view of query languages. During optimization, 
a query is transformed into a QEP represented as a binary tree which captures 
the order in which a sequence of physical algebraic operations are going to be 
executed. The leaves represent data files or indices, internal nodes represent alge- 
braic operations and edges represent dataflows between operations. Examples of 
algebraic operations include data access {FileScan or Row Access) , spatial selec- 
tions, spatial joins, etc. As mentioned above we use as a common framework for 
query execution, a demand-driven process with iterator functions Each 

node (operation) is an iterator. This allows for a pipelined execution of multiple 
operations, thereby minimizing the system resources (memory space) required 
for intermediate results: data consumed by an iterator, say /, is generated by 
its son(s) iterator(s), say J. Records are produced and consumed one-at-a-time. 
Iterator I asks iterator J for a record. Therefore the intermediate result of an 
operation is not stored in such pipelined operations except for some specific 
iterators called blocking iterators, such as sorting. 

This design allows for simple QEP creation by “assembling” iterators to- 
gether. Consider the QEP for a spatial join i? N S' implemented by the simple 
scan-and-index (SAI) strategy (Fig. |3a): scan R {FileScan)-, for each tuple r in 
R, execute a window query on index 1$ with key r.MBR. This gives a record 
ID RecordID20. Finally read the record with id RecordID2 in S {RowAccess). 



^ As a matter of fact each index (leaf) access returns an IndexEntry, i.e., a pair [MBR, 
RecordID] . For the sake of simplicity, we do not show this MBR on the figures. 
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The refinement step not represented in the figure can then be performed on the 
exact spatial object available in Recordl and Record2. 




a. A scan-and-index join 



b. A build and match join 



Fig. 2. Query execution plans 



This is a fully pipelined QEP: therefore the response time (e.g., the time to 
get the first record) is minimal. It is sometimes necessary to introduce blocking 
iterators in a QEP which require the consumption of all input data before any 
output is possible. Then significant memory is necessary for intermediate results. 

As an example, one can introduce a Sort blocking iterator in the QEP of 
Fig.|21a in order to sort the data flow output by the SAI join on the PagelD of 
the RecordID2 component. This allows to access only once the pages of S instead 
of issuing random reads to get the S records, which might lead to several accesses 
to the same page. However no record can be delivered to the user before the join 
is completely processed. 

As a more complex (and realistic) example, consider the QEP of Fig. Elb. It 
implements a different join strategy: an index already exists on S', another one 
is built on the fly on R (iterator Buildj^ and an STT join is executed. Such a 
join delivers pairs of RecordID, hence two further RowAccess, one per relation, 
are necessary to complete the query (refinement step). Build is blocking: the 
join cannot be started before index In has been completely built. In addition, 
the maximal amount of available memory should be assigned to this iterator to 
avoid as much as possible to flush pages on disk during index construction. 

It may happen that a QEP relies on several blocking iterators. In that case 
the management of memory is an important issue. Consider the QEP of Fig. 0b. 
The STT node delivers pairs of RecordID, [r^, Si], which resembles the join index 
upon non clustered data, as described in |Val87j . The naive strategy depicted 
in Fig. 0b alternates random accesses on the datafiles R and S; then the same 
page (either in R or S) will be accessed several times, which leads to a large 

^ R can be an intermediate result delivered by a sub-QEP. 
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number of page faults. The following preprocessing algorithm is proposed in 
and denoted segmented sort (SegSort) in the sequel: (1) allocate a buffer 
of size B] (2) compute the number n of pairs (Recordl ,Rec°rdID2) which can 
fit in B; (3) load the buffer with n pairs (RecordIDl, RecordID2); (4) sort on 
RecordIDl; (5) access relation R and load records from R (now the buffer is 
full); (6) sort on RecordID2; load records from S, one at a time, and perform 
the refinement step. Repeat from step (3) until the source (STT join in that 
case) is exhausted. Hence, this strategy, by ordering the pairs of records to be 
accessed, saves numerous page faults. 

The resulting QEP includes two blocking iterators {Build and SegSort) be- 
tween which the available buffer memory must be split. Basically there are two 
strategies for memory allocation for such QEPs: 

1. Flush intermediate results. This is the simplest solution: the total buffer 
space M allocated to the QEP is assigned to the Build iterator, the result of 
the join (STT) is flushed onto disk and the total space M is then reused for 
SegSort. The price to be paid is a possibly very large amount of write and 
read operations onto (from) disk for intermediate results. 

2. Split memory among iterators and avoid intermediate materialization. Each 

of the iterators of the QEP is assigned part of the global memory space M. 
Then intermediate results are kept in memory as much as possible but less 
memory is available for each iterator |br^2H1 ■ 

4 Spatial Join Query Processing 

Using the above platform, our objective is to experimentally evaluate strategies 
for queries involving one or several spatial joins in sequence. 

Fig.0 illustrates two possible QEPS for processing query N i ?2 ■ • ■ Bn, 
using index /i, / 2 , . . . which both assume (i) the optimizer tries to use as 
much as possible existing spatial indices when generating QEPs and (ii) that 
the n-way join is first evaluated on the MBRs (filter step) and then on the ex- 
act geometry: an n-way join is performed on a limited number of tuples of the 
cartesian product i?i x i ?2 x ...Rn (refinement step, requiring n row accesses). 
Both QEPS are left-deep trees pUra93| . In such trees the right operand of a join 
is always an index, as well as the left operand for the left-most node. Another 
approach, not investigated here, would consists in an n-way STT, i.e., a synchro- 
nized traversal of n R-trees down to the leaves. See |IVI f99j for a comprehensive 
study. 

The first strategy (Fig. Ela) is fully pipelined: a STT join is performed as the 
left-most node, and a SAI join is executed for the following joins: at each step 
a new index entry [MBR, RecordID] is produced. The MBR is the argument 
of a window query for the following join. The result is a tuple ii,i 2 , . . .in of 
record id: the records are then retrieved with RowAecess iterators, one for each 
relation, in order to perform the refinement step on the n-way join but on a 
limited number of records. The second strategy (Fig. Elb) uses instead of SAI 
the Build-and-Match strategy. 
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a. A left-deep tree with pipelined iterators 




b. A left-deep tree with blocking iterators 



Fig. 3. Two basic strategies for left-deep QEPs 



Evidently, the QEPs shown in Fig. 0 are extreme cases. Depending on the 
estimated size of output (s), a merge of both strategies can be used for a large 
number of joins. More importantly, the refinement step can be done prior to 
the completion of the query if it is expected that the candidate set contains a 
large number of false hits. By computing the refinement step in a lazy mode, as 
suggested in Fig. I3 the cardinality of intermediate results is larger (because of 
false hits) but the size of records is smaller. 

We do not consider the case of bushy trees since they involve joins algorithms 
upon non-indexed relations. As an example of bushy-tree QEP, consider the 
following QEP for the join of 4 relations Ri, R 2 , R 3 , Ra'. R\ and i ?2 are joined 
using STT as well as R 3 and Ra. The two (non-indexed) intermediate results 
must then be joined. In the case of n^3 (only one or two joins) which will be 
considered here, only left-deep trees can be generated. 

Join Strategies 

We describe in this section the three variants of the same strategy called 
Build- and- Match (Fig. 0b) which consists in building on the fly an index on a 
non indexed intermediate relation and to join the result with an indexed relation. 
When the structure built is an R-tree, then the construction is followed by a 
regular STT join. The rationale of such an approach is that even though building 
the structure is time consuming, the join behind is so efficient that the overall 
time performance is better than applying SAL Of course the building phase is 
implemented by a blocking iterator and requires memory space. 

STJ 

The first one is the Seeded Tree Join (STJ) [LP,98IJ . This technique consists in 
building from an existing R-tree, used as a seed, a second R-tree called seeded 
R-tree. The motivation behind this approach is that tree matching during the 
join phase should be more efficient than if a regular R-tree were constructed. 
During the seeding phase, the top k levels of the seed are copied to become the 
top k levels of the seeded tree. The entries of the lowest level are called slots. 
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During the growing phase, the objects of the non indexed source are inserted 
in one of the slots: a rectangle is inserted in the slot that contains it or needs 
the least enlargement. Whenever the buffer is full, all the slots which contain at 
least one full page are written in temporary files. 




/ I \ ^ I ' 

/ I \ I 



Seeding tree Seeded tree 



a. The seeding phase 




Fig. 4. Seeded tree construction 



When the source has been exhausted, the construction of the tree begins: for 
each slot, the objects inserted in the associated temporary files (as well as the 
objects remaining in the buffer) are loaded to build an R-tree (called a grown 
subtree): the slot entry is then modified to point to the root of this grown subtree. 
Finally a cleanup phase adjusts the bounding boxes of the nodes (Fig. EJ, as in 
classical R-trees. 

The grown subtrees may have different heights: hence the seeded tree is not 
balanced. It can be seen as a forest of relatively small R-trees: one of the expected 
advantages of the method is that the construction of each grown subtree is done 
in memory. 

There is however an important condition to fulfill: the buffer must be large 
enough to provide at least one page to each slot. If this is not the case, the pages 
associated to a slot will be read and written during the growing phase, thus 
rendering the method ineffective. 

STR 

The second Build- And-Match variant implemented, called Sort-Tile-Recursive 
(STR), constructs on the fly a STR packed R-tree |LEL97j . We also experimented 
the Hilbert packed R-tree mm , but found that the comparison function (based 
on the Hilbert values) was more expensive than the centroid comparison of STR. 

The algorithm is as follows. First the rectangles from the source are sortecH 
by ^-coordinate of their centroid. At the end of this step, the size N of the 
dataset is known: this allows to estimate the number of leaf pages as P = \N/c\ 
where c is the page capacity. The dataset is then partitioned into [ VT~\ vertical 
slices. The \VP~\ .c rectangles of each slice are loaded, sorted by the y-coordinate 

® The sort is implemented as an iterator which carries out a sort-merge algorithm 
according to the design presented in 
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of their center, grouped into runs of length c and packed into the R-tree leaves. 
The upper levels are then constructed according to the same algorithm. At each 
level, the nodes are roughly organized in horizontal or vertical slices (Fig 0. 



sort/flush 

load/merge 



sorted 



a. The sort phase 



Leaves 




y slice 



X slice 




Fig. 5. STR tree construction 



SaM 

The third Build-And-Match variant called Sort-and-Match (SaM) is novel. It 
uses the STR algorithm but the construction is stopped at the leaf level, and 
the pages are not written onto disk. As soon as a leaf I has been produced, it is 
joined to the existing R-tree Ir: a window query with the bounding box of I is 
generated which retrieves all Ir leaves V such that l.MBR intersects V .MBR. I 
and V are then joined with the plane-sweep algorithm already used in the STT 
algorithm. 

An interesting feature of this algorithm is that, unlike the previous ones, it 
does not require the entire structure to be built before the matching phase thus 
saving the flushing of this structure onto disk, resulting in much faster response 
time. 



5 Performance Evaluation 

The machine used throughout the experiments is a SUN SparcStation 5 with 
32 MB of memory, running SunOS 4.1. We use in our experiments synthetic 
datasets, created with the ENST rectangle generator |OOP+9^ . This too0 gen- 
erates a set of rectangles according to a statistical model whose parameters (size, 
coverage, distribution) can be specified. The 3 following statistical models were 
used sharing the same 2D universe (map): 

1. Counties (called Biotopes in |GOP~*~98| l simulates a map of counties; rectan- 
gles have a shape and location uniformly distributed, and the overlap (ratio 
between sum of the areas of the rectangles and map area) is 100%. 

^ Available at http://www-inf.enst.fr/ bdtest/sigbench/. 
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Datafiles 
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Fig. 6. Database sample 



2. Cities |GOP+9^ simulates a map of cities: the map contains small rectangles 
whose shape is normally distributed (around the square shape) and whose 
location is uniformly distributed. The overlap is equal to 5%. 

3. Roads simulates a map of roads: rectangles location and shape is uniform as 
in Counties but overlap is 200%. 

For each of the statistical models, 5 datasets have been generated, with a 
size ranging from 20 000 to 100 000 objects referred to as DATxx, where DAT 
is in COUN, CIT, ROA and xx ranges from 20 to 100. For example, COUN20 
stands for Counties with 20 000 rectangles. 

Join strategies are evaluated on the query Cities N Counties in the case of 
single joins and the query Cities N Counties N Roads for two-way joins. 

We assume a page size of 4K and a buffer size ranging from 400K (100 pages) 
to 2.8MB (700 pages). The record size is 158 bytes and the buffer policy is LRU. 
Fig. 0 gives some statistics on the generated database (data file and index). Only 
the information on Counties is reported. Indeed the sizes do not depend on the 
statistical model, so Cities and Roads have almost identical characteristics. The 
fanout (maximum number of entries/page) of an R*tree node is 169. We give the 
number of entries in the root since it is an important parameter for the seeded 
tree construction. 

The main performance criteria are (i) the number of I/O, i.e., the number of 
calls (page faults) to the I/O module and (ii) the CPU consumption. 

The latter criteria depends on the algorithm. It is either measured as the 
number of comparisons (when sorting occurs), or the number of rectangle in- 
tersections (for join) or the number of unions (for R-tree construction): see Ap- 
pendix A. The parameters chosen are the buffer size, the data set size and of 
course, the variants in the query execution plan and the join algorithms. 

Single Join 

When there is a single join in the query and both relations are indexed, a 
good candidate strategy is STT. Part of our work below is related to a closer 
assessment of this choice. To this end, we investigate the behavior of the can- 
didate algorithms for single joins, namely SAI and STT. Fig. 0 gives for each 
algorithm, the number of I/Os as well as the number of rectangle intersection 
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tests (NBI), for a buffer set to 250 pages (1 MB). STT^^ stands for a QEP 
where the join is followed by a RowAccess operator, while STT is a stand alone 
join. Indeed, SAI and STT^^ deliver exactly the same result, namely pairs of 
[Record, RecordID] , while STT only yields pairs of RecordID. 

As expected, the larger the dataset, the worse is SAI performance, both in 
I/Os and NBIs. There is a significant overhead as the R*tree size is larger than 
the available buffer. This is due to the repeated execution of window queries 
with randomly distributed window arguments. 

STT outperforms SAI with respect to both I/Os and NBI. But as explained 
above, the comparison to be done is not between SAI and STT but between SAI 
and STT/j^. Then, looking at Fig. Q, the number of I/Os is of the same order 
for the two algorithms. Furthermore, it is striking that the RowAccess cost is 
more than one order of magnitude larger than the join itself for STT (e.g., for 
a dataset size of lOOK, there are 104 Oil I/Os while the join phase costs only 1 
896 I/Os)! 

The RowAccess iterator in the QFP implementing STT^y^, reads the pages 
at random. Then a large number of pages are read more than once. The number 
of I/Os depends both on the buffer size and on the record size (here 158, which 
is rather low) and can be estimated according to the model in |Yao77j . 

Since STT’s performance (without RowAccess) is not very sensitive to an 
increase in the index size, it should not be very sensitive to a decrease in memory 
space. This justifies that most of the buffer space available should be dedicated 
to the RowAccess iterator in order to reduce its extremely large cost. 
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Fig. 7. Left-most node: join Cities-Counties, buffer size = 250 pages 



To reduce the number of datafile accesses, we insert in the QFP a SegSort 
iterator before the RowAccess. Pages whose ids are loaded in the SegSort buffer 
can then be read in order rather than randomly. The efficiency depends on the 
size SGB of this buffer. 

Fig. 0 displays the number of I/Os versus the data size, for SAI and STT, 
for several values of the parameter SGB. The total buffer size is 250 pages, and 
is split into a buffer dedicated to SegSort and a ’global’ buffer whose size is 250 
- SGB. STT-xx stands for the STT join where SGB=xx. In order to compare 
with the results of Fig. Q we only access one relation. The larger SGB, the 
larger the gain. This is due to the robustness of STT performance with respect 
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to buffer size: its performance is not significantly reduced with a small dedicated 
buffer size. 




Fix 



Fig. 8. SegSort experiment 



Figure 0illustrates the gain from sorting the pages to be accessed: for a large 
data set size, the gain with STT-200 is almost 3, compared to STTj^^. 

In conclusion, the combination of STT with a SegSort operator (or any other 
mean to reduce the cost of random I/Os, for instance spatial data clustering) 
outperforms SAL 

We now compare the performance of the 3 Build- And-Match candidate algo- 
rithms (STJ, STR and SaM). Both the Build and the Match phases are consid- 
ered, but we do not account for any FileScan. In other words, as stressed above, 
we restrict to the case where the join is executed on an intermediate result in 
which each tuple is produced one at a time. 

Figure 0a displays the cost of the 3 algorithms for 4 data set sizes. The case 
of STJ deserves some discussion. Note first that it is very unlikely that we can 
copy more than the root of the seeding tree because of the large fanout (169) 
of the R*tree. Indeed, in copying the first level also, the number of slots would 
largely exceed the buffer size. 

In copying only the root, the number of slots may vary between 2 and 169. 
Actually, in our database the root is either almost full (dataset size 20K) or 
almost empty (dataset size ^ 40K). See Figure 0 

When the number of slots is large, one obtains a large number of grown R- 
trees (one per slot) whose size is small. Then the memory utilization is very low: 
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Fig. 9. Build and match joins 



an almost empty root with 2 or 3 leave^. If the number of slots is small, then 
there is a small number of large R-trees, each of them requiring a significant 
construction time. In all cases, the CPU construction cost is high, although the 
I/Os cost is low because each grown subtree can be built in memory. 

STR and SaM are extremely efficient with small dataset sizes (40K). Indeed 
the construction of the index is entirely done in main memory. Even for large 
data sets, SaM is very efficient. Compared to STR, the number of rectangle 
intersections is the same, but since the tree is not constructed the number of 
I/Os is smaller, the more the data set size increases (it is 20 % smaller than for 
STR with a dataset size greater than 80K) . 

During the match phase, SaM is also efficient: in fact it can be seen as a 
sequence of window queries, with two major improvements: (i) leaves are joined, 
and not entries, hence one level is saved during tree traversal, and (ii) more 
importantly, two successive leaves are located in the same part of the search 
space. Therefore the path in the R-tree is likely to be already loaded in the 
buffer. 

We now test the robustness of the algorithms performance with respect to 
the buffer size. In Figure II i)l we measure the performance of the algorithms by 



® We do not pack the roots of grown subtrees, as proposed in [.LR98J . This renders the 
data structure and implementation complex, and has some further impact on the 
design of the STT. 
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joining two lOOK datasets and letting the buffer size vary from 100 pages (400K) 
to 700 pages (2.8 MB). RowAccess is not taken into account. We do not include 
the cost of STJ for the smallest buffer size since buffer thrashing cannot be 
avoided in that case. 




Iwtw frn 



Fig. 10. JOIN Cities lOOK - Counties lOOK, varying buffer 



Looking at FigureEJ, the following remarks are noteworthy: (i) the sort-based 
algorithms benefit from large buffers; this is less clear for STJ; (ii) as expected, 
STT performance is robust with respect to buffer size; this is important since 
algorithms whose memory requirement is known and reasonable in size allow for 
more flexibility when assigning memory among several operators, as shown in 
the next section. Observe also that when the Build phase can be performed in 
memory, the Join phase of SaM outperforms STT; (iii) the larger the buffer size, 
the more SaM outperforms the two other Build- And-Match strategies: while its 
gain over STR is only 20% for small buffer size, it reaches three for a buffer 
capacity of 700 pages. 

Two Way Joins 

This section relies on the above results for the evaluation of QEPs involving 
two joins. In the sequel, the left-most node of the QEP is always an STT al- 
gorithm performed on the two existing R*trees (on Cities and Counties) which 
delivers pairs of index entries [ 11 ,^ 2 ]. The name of a join algorithm denotes the 
second join algorithm, which takes the result of STT, builds a structure and 
performs the join with the index on Roads. 
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Fig. 11. Two way joins 



Note that in that case one does not save a RowAccess with SAI for the 
refinement step. Indeed as the Build-And-Match strategies, SAI reads as an 
entry only an index entry [RecordID,MBR] from the STT join. The result is, in 
all cases, a set of triplets of index entries. 

The datasets Counties, Cities and Roads, have equal size, and a fixed buffer 
of 500 pages has been chosen. We make the experiments for the medium size of 
40K and the larger size of lOOK. The latter 2 way-join yields 865 473 records, 
while the former 314 617 records. 

Figure [ffl gives the response time for SAI and the three variants of Build- 
And-Match algorithms. Let us look first at SAI performance. For a small dataset 
size (40K), the index fits in memory, and only few I/Os are generated by the 
algorithm. However the CPU cost is high because of the large number of inter- 
section tests. For large dataset sizes, the number of I/Os is huge, rendering this 
algorithm definitely not the right candidate. 

STJ outperforms SAI for large datasets. But its performance is always much 
below that of SaM and STR. The explanation of this discrepancy is the following. 
For a 40K size, the first level of the seeding tree could be copied, resulting into 
370 slots. The intermediate result consists of 116 267 entries. So, there is an 
average of 314 entries per slot: each subtree includes a root with an average of 
two leaves, leading to a very bad space utilization. A large number of window 
queries are generated due to the unbalance of the matched R-tree. In the case 
of lOOK datasets, only 8 slots can be used, and the intermediate result consists 
of 288 846 records. Hence we must construct a few, large R-trees, which is very 
time consuming. 

SaM significantly outperforms STR, mostly because it saves the construction 
of the R-tree structure, and also because the join phase is very efficient. It is 
worth noting, finally, that SAI is a good candidate for small datasets sizes, 
although its CPU cost is still larger. One should not forget that SAI is, in that 
case, the only fully pipelined QEP. Therefore the response time is very short, a 
parameter which can be essential when the regularity of the data output is more 
important than the overall resource consumption. 
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Discussion 

By considering complete QEPs, including the I/O operations for the refine- 
ment step, we were able to identify the bottlenecks and the interactions between 
the successive parts of a QEP. 

The efficiency of the commonly accepted STT algorithm is natural: an index 
is a small, structured collection of data, so joining two indices is more efficient 
than other strategies involving the data files. The counterpart, however, is the 
cost of accessing the records after the join for the refinement step, whose cost 
is often ignored in evaluations, although several papers report the problem (see 
for instance ipnnni and the recent work of lAdPzyyi l. It should be noted that 
in pure relational optimization, the manipulation of RecordID lists has been 
considered for a long time to be less efficient than the (indexed) nested loop join 
FTE77! . Even nowadays, the ORACLE DBMS does use a SAI strategy in the 
presence of two indices [( )ra,) . In the context of spatial databases, though, SAI 
provides a prohibitive cost as soon as the index size is larger than the buffer 
and the number of window queries is high. Whenever STT is chosen, we face 
the cost of accessing the two relations for the refinement step. When data is not 
spatially clustered, the present experiment suggests to introduce a scheduling of 
row accesses through a specific iterator. We used the algorithm of jValSTj ■ but 
other techniques are available The combination of STT and SegSort 

outperforms SAI for large datasets, in part because of the robustness of STT 
with respect to the buffer size. 

For two-way joins, the same guidelines should apply. Whenever we intend 
to build an index for subsequent matching with an existing R-tree, the build 
algorithm performance should not degrade when there is a shortage of buffer 
space, since most of the available space should be dedicated to the costly access to 
records after the join. We experimented three such Build- And-Match strategies: a 
top-down index construction (STJ), a bottom-up index construction (STR) and 
an intermediate strategy which avoids the full index construction (SaM) . Several 
problems were encountered with STJ, while the classical solutions based on 
sorting appear quite effective. They provide a simple, robust and efficient solution 
to the problem of organizing an intermediate result prior to its matching with an 
existing index. The SaM algorithm was shown to be a very good candidate: it can 
be carried out with reasonably low memory space and provides the best response 
time since its Build phase is not completely blocking: records are produced before 
the build phase is completed. 

6 Conclusion and Future Work 

The contribution of this paper is three fold: (i) provide an evaluation platform 
general enough to experimentally evaluate complex plans for processing spatial 
queries and to study the impact on performance of design parameters such as 
buffer size, (ii) show that in build-and-match strategies for spatial joins it was 
not necessary to completely build the index before the join: this resulted into a 
join strategy called SaM that was shown in our experiment to outperform the 
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other known build-and-match strategies, (iii) show that physical operations that 
occur in the query execution plan associated with a join strategy have a large 
impact on performance. For example, we studied the impact of record access 
after the join, which is a very costly operation. 

The performance evaluation stressed the importance of memory allocation 
in the optimization of complex QEPs. The allocation of available buffer space 
among the (blocking) operators of a QEP, although it has been addressed at 
length in a pure relational setting, it is still an open problem. We intend to 
refine our evaluation by studying the impact of selectivity and relation size on 
the memory allocation. Some other parameters such as the data set distribution 
or the placement of the record access in the QEP may also have some impact. 
The aim is to exhibit a cost model simple enough to be used in an optimization 
phase to decide for memory allocation. 



References 



[AGPZ99] 

[APR+98] 

[BE77] 

[BKK96] 

[BKS93] 

[BKSS90] 

[BKSS94] 

[BKV98] 

[GOP+98] 

[Gra93] 

[GS87] 



D. Abel, V. Gaede, R. Power, and X. Zhou. Gaching Strategies for Spatial 
Joins. Geoinformatica, 1999. To appear. 

L. Arge, O. Procopiuc, S. Ramaswami, T. Suel, and J. Vitter. Scalable 
Sweeping Based Spatial Join. In Proc. Inti. Conf. on Very Large Data 
Bases, 1998. 

M. Blasgen and K. Eswaran. Storage and access in relational databases. 
IBM System Journal, 1977. 

S. Berchtold, D. Keim, and H.-P. Kriegel. The X-tree: An Index Structure 
for High-Dimensional Data. In Proc. Inti. Conf. on Very Large Data Bases, 
1996. 

T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Efficient Processing of Spatial 
Joins Using R- Trees. In Proc. ACM SIGMOD Symp. on the Management 
of Data, 1993. 

N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger. The R*tree : An 
Efficient and Robust Access Method for Points and Rectangles. In Proc. 
ACM SIGMOD Inti. Symp. on the Management of Data, pages 322-331, 
1990. 

T. Brinkhoff, H.P. Kriegel, R. Schneider, and B. Seeger. Multi-Step Pro- 
cessing of Spatial Joins. In Proe. ACM SIGMOD Symp. on the Manage- 
ment of Data, pages 197-208, 1994. 

L. Bouganim, O. Kapitskaia, and P. Valduriez. Memory Adaptative 
Scheduling for Large Query Execution. In Proe. Inti. Conf. on Information 
and Knowledge Management, 1998. 

O. Gunther, V. Oria, P. Picouet, J.-M. Saglio, and M. Scholl. Bench- 
marking Spatial Joins A La Carte. In Proc. Inti. Conf. on Scientific and 
Statistical Databases, 1998. 

G. Graefe. Query evaluation techniques for large databases. ACM Com- 
puting Surveys, 25(2):73-170, 1993. 

R.H. Giiting and W. Schilling. A Practical Divide-and-Gonquer Algorithm 
for the Rectangle Intersection Problem. Information Sciences, 42:95-112, 
1987. 




306 Apostolos Papadopoulos, Philippe Rigaux, and Michel Scholl 



[Gun93] 

[Gut84] 

[HJR97] 

[KF93] 

[KS97] 

[LEL97] 

[LR96] 

[LR98] 

[MP99] 

[ND98] 

[NHS84] 

[Ora] 

[Ore86] 

[PD96] 

[RL85] 

[SRF87] 

[Val87] 

[VG98] 

[Yao77] 



0. Gunther. Efficient Gomputation of Spatial Joins. In Proc. IEEE Inti. 
Conf. on Data Engineering^ pages 50-59, 1993. 

A. Guttman. R-trees : A Dynamic Index Structure for Spatial Searching. 
In Proc. ACM SIGMOD Inti. Symp. on the Management of Data, pages 
45-57, 1984. 

Y.-W. Huang, N. Jing, and E.A. Rudensteiner. Spatial Joins Using R- 
trees: Breadth-first Traversal with Global Optimizations. In Proc. Inti. 
Conf. on Very Large Data Bases, 1997. 

1. Kamel and G. Faloutsos. On Packing Rtrees. In Proc. Inti. Conf. on 
Information and Knowledge Management (CIKM), 1993. 

N. Koudas and K. G. Sevcik. Size separation spatial join. In Proc. ACM 
SIGMOD Symp. on the Management of Data, 1997. 

S. Leutenegger, J. Edgington, and M. Lopez. STR: a Simple and Effi- 
cient Algorithm for Rtree Packing. In Proc. IEEE Inti. Conf. on Data 
Engineering (ICDE), 1997. 

M.-L. Lo and C.V. Ravishankar. Spatial Hash- Joins. In Proc. ACM SIG- 
MOD Symp. on the Management of Data, pages 247-258, 1996. 

M. -L. Lo and C.V. Ravishankar. The Design and Implementation of 
Seeded Trees: An Efficient Method for Spatial Joins. IEEE Transactions 
on Knowledge and Data Engineering, 10(1), 1998. First published in SIG- 
MOD’94. 

N. Mamoulis and D. Papadias. Integration of spatial join algorithms for 
joining multiple inputs. In Proc. ACM SIGMOD Symp. on the Manage- 
ment of Data, 1999. 

B. Nag and D. J. DeWitt. Memory Allocation Strategies for Complex De- 
cision Support Queries. In Proc. Inti. Conf. on Information and Knowledge 
Management, 1998. 

J. Nievergelt, H. Hinterger, and K.C. Sevcik. The Grid File: An Adapt- 
able Symmetric Multikey File Structure. ACM Transactions on Database 
Systems, 9(1):38-71, 1984. 

Oracle 8 Server Concepts, Chap. 19 (The Optimizer). Oracle Technical 
Documentation. 

J. A. Orenstein. Spatial Query Processing in an Object-Oriented Database 
System. In Proe. ACM SIGMOD Symp. on the Management of Data, pages 
326-336, 1986. 

J.M. Patel and D. J. DeWitt. Partition Based Spatial-Merge Join. In Proc. 
ACM SIGMOD Symp. on the Management of Data, pages 259-270, 1996. 
N. Roussopoulos and D. Leifker. Direct Spatial Search on Pictorial 
Databases Using Packed R- Trees. In Proc. ACM SIGMOD Symp. on the 
Management of Data, pages 17-26, 1985. 

T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+Tree: A Dynamic 
Index for Multi-Dimensional Objects. In Proc. Inti. Conf. on Very Large 
Data Bases (VLDB), pages 507-518, 1987. 

P. Valduriez. Join Indices. ACM Trans, on Database Systems, 12(2):218- 
246, 1987. 

V.Gaede and O. Guenther. Multidimensional Access Methods. ACM 
Computing Surveys, 1998. available at http://www.icsi.berkeley.edu/ oliv- 
erg/survey.ps.Z. 

S. B. Yao. Approximating Block Accesses in Data Base Organizations. 
Communieation of the ACM, 20(4), 1977. 




A Performance Evaluation of Spatial Join Processing Strategies 307 



Appendix A 

We give below a simple cost model for estimating the response time of an al- 
gorithm (query), which includes both I/Os and CPU time. For the I/O time 
calculation, we just assume that each I/O, i.e., that each disk access has a fixed 
cost of lOmsec. Therefore, if nbiO denotes the number of I/Os, the time cost (in 
seconds) due to the disk is: 



Tdtsk = nbJo ■ O.OI 



( 1 ) 



In order to estimate CPU time, we restricted to the following operations: rect- 
angle intersections, rectangle unions and sort comparisons. The parameters are 
then: (a) the number of rectangle intersections nb -inter, (b) the number of num- 
ber comparisons nb-comp and (c) the number of rectangle unions nb-union. 
Since we consider a two-dimensional address space (generalizations are straight- 
forward), each test for rectangle intersection costs four CPU instructions (two 
comparisons per dimension) . Also, each rectangle union costs four CPU instruc- 
tions. Finally, each comparison between two numbers costs one CPU instruction. 
If MIPS denotes the number of instructions executed in the CPU per second, 
then the time for each operation is calculated as: 









nb-inter ■ 4 
MIPS 

nb-union • 4 
MIPS 

nb-Comp 



Tcomp- 

The CPU cost is thus estimated as 

Tproc — Winter ^union Fcomj? 
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In addition to the above CPU costs, we assume that each read or write opera- 
tion contributes to a CPU overhead of 5000 CPU instructions for pre and post 
processing of the page: 



^ nb-io ■ 5000 ^ 

^prep - ^jpg ■ iU 

The total CPU cost is then 

TcPU — Tprep T Tpy-Qc 
The response time of a query is then estimated as: 

^response — T(jpu T 



( 6 ) 

(7) 
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Abstract. A geologic map is a 2-dimensional representation of an in- 
terpretation of 3-D phenomena. The work of a geologist consists mainly 
in (i) inferring subsurface structures from observed surface phenomena 
and (ii) building abductive models of events and processes that shaped 
them during the geologic past. In order to do this, chains of explanations 
are used to reconstruct the Earth history step-by-step. In this context, 
many interpretations may be associated with a given output. In this pa- 
per, we first present the general contexts of geologic map manipulation 
and design. We then propose a framework for geologic map designers 
which supports multiple interpretations. 



1 Introduction 

A geologic map of a given area is a 2-dimensional representation of accepted 
models of its 3-dimensional subsurface structures. It contains geologic data which 
allow an understanding of the distribution of rocks that make up the crust of 
the Earth as well as the orientation of the structures they contain. It is based on 
a geologist’s model explaining the observed phenomena and the processes that 
shaped them in the geologic past. In the current analog approach, this model is 
recorded in a geologic map with an explanatory booklet that describes the au- 
thor’s conclusions as well as relevant field and other observations (e.g., tectonic 
measurements, drill-hole logs, fossil records). Today, this variety of information 
is handled in a digital and even hypermedial form. This necessitates to conceive, 
develop and implement a suitable geologic hypermap model beforehand. The 
main objective of our project is the design of models and tools well-suited for 
the interaction between users and geologic hypermaps. Geologic hypermaps are 
a family of hyperdocuments A( JIVIDTt] with peculiar requirements due to the 
richness (in terms of semantics and structure) of the information to be stored. In 
these applications, users in general are both endusers (e.g., engineers or geology 
researchers) and designers (map makers). 

This contribution focuses on the handling and representation of geologic 
knowledge within hypermap applications and addresses the needs for a data 
model that supports multiple interpretations from one or many geologists. When 
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designing geologic maps, the objectives are twofold: (i) inferring subsurface struc- 
tures from observed surface phenomena and (ii) building abductive models of 
events and processes that shaped them in the geologic past. For this, chains of 
explanation are used to reconstruct the Earth history step by step. 

Basic requirements for handling geologic applications in DBMS environ- 
ments are described in issnHi. Those have been studied within a joint project 
with geologists at the Free University of Berlin. We focus here on the task of 
the map maker. To our knowledge, this is one of the first attempts to define 
tools for geologists. The US Geological Survey [IUSGS97j is currently defining 
a database to store geologic information in a relational context. A few authors 
(e.g., |Hou94llBW^[BBG97j l have also studied the 3-dimensional geometric as- 
pects of geologic applications. Note however that, even though our goal is to 
build tools for the next generation of geologic maps (i.e., stored in a database), 
the map creation cannot be fully automated as some knowledge is difficult to 
express in terms of rules: There are no real laws that work at all time in all 
possible situations. Hence some steps still need to be performed manually. 

Building the next generation of tools for geologists is a challenging task. The 
underlying geologic models and all the possible ways of manipulating the in- 
formation make the supporting systems extremely complex. In addition, these 
systems borrow from many disciplines such as geospatial databases of course but 
also sophisticated visualization, simulation models, or artificial intelligence. 

Here we restrict our attention to the reasoning process of geologists. As a 
first step, we present an explanation model for map designers, which is meant to 
be used during the abduction and the deduction processes. The model is based 
on complex explanation patterns (e.g., simulation models, similarities), and con- 
fidence coefficients. Even though geologic map interpretation has been studied 
thoroughly in the field of pure geology (e.g., |BIy76U E w92p for many years, to 
the best our knowledge, the mechanism behind geologic map making was not 
studied by computer scientists. Beside its complexity, a reason for that is the 
current lack of data. At present, not many complete geologic maps are stored in a 
digital form. So far, publishing a geologic map was a time-consuming task. These 
maps are now in the process of being digitized, but it will take many years before 
several documented versions of a map are available. The ultimate goal we are 
aiming at is the participation in the definition of a digital library of geologic maps 
whose elements may serve as starting points for further geologic map definitions. 

This paper is organized as follows. Section 2 gives examples of geologic map 
manipulation and shows three main categories of users. Objects of interest and 
representative queries are given for each category. In Section 3, we present a rea- 
soning model based on explanations and coefficients of confidence. Finally, Sec- 
tion 4 draws our conclusions, relates our work to other disciplines, and presents 
the future steps of our project. 
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2 Geologic Hypermaps Manipulation 

According to the degree of expertise of the enduser, manipulating geologic hy- 
permaps can be understood at many levels of abstraction. Three main categories 
of users can be identified, from the naive user to the application designer. They 
all communicate with the geologic maps database in a different way. While a 
naive user would like to know, for instance, the nature of soils in a given area, a 
more sophisticated user would like to find out why a particular geologic object 
was defined a certain way. An even more sophisticated user would like to access 
knowledge on other areas, for instance for comparison. Note that the most so- 
phisticated users also have the requirements of the less qualified ones. They all 
access the basic data of their level together with metadata, which is understood 
differently for each category. 

In this section, each category of users is studied separately. For each one we 
give the main objects of interests. To illustrate our discourse, examples of queries 
and a description of the tools needed are also given. This leads to a hierarchy of 
tools for manipulating such maps. 

2.1 Traditional Enduser (User Type 1) 

These users want to get straightforward information stored in a geologic database. 
A typical user of this category is an engineer who would like to find out the type 
of the soil in a given area in order to install a water pipe. These endusers need 
to access such basic information as well as metainformation. Metainformation 
has two aspects here. In a geospatial sense, it denotes information regarding 
the origin of data. It is for instance the date and the conditions under which 
some measurements were performed. This metadata is invariant in the three 
categories. The other kind of metadata is a high level description of data in the 
database, as for instance the possible values of a given attribute at a naive user 
level. 



Objects of Interest. The major objects of interest at this level are geospa- 
tial, textual and multimedia objects (e.g., pictures and videos). These objects 
denoted HO are linked together via hyperlinks. A geologic object is a complex 
entity of the real world. It has a description as well as spatial attributes. The 
description may be elaborate, as we will see later. In particular, structured infor- 
mation, possibly of a multimedia type, can be attached to a geologic object. This 
information can be easily accessed in a hypermap system. What the users see 
and interact with are hypermaps in a graphical window. These are manipulated 
in a straightforward manner, through both mouse clicks on cartographic objects 
and a basic user interface that allows to access data values. More information 
on the HO’s manipulation and structure can be found in |Voi98j . 
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Definition 2.1 A geologic map GM is a directed weakly connected graph 
(Vho,Eho,F), where 

1. Vho is a set of hyperobjects HO 

2. Eho is a set of edges. 

3. F : Eho 'P{Vho) is a labeling function. 

A prototype for this category of users was coded using ArcViews IHSH.Ulbaj 
and its programming language Avenue |ESR.I961^ . as reported in jKSV98) . In 
addition to basic visualization and querying, features such as flexible attribute 
display and specialized tools for specialists (e.g., a soil scientist) were also im- 
plemented. Figure 1 gives an example of a screenshot of the prototype. 




Fig. 1. Screenshot from the prototype 



Examples of Queries. Queries to be posed to the system may include: 

At a basic data level (instance level) 

o What are the characteristics (alphanumeric description) of this object {point 
query)l 

o What are all geo-objects in this area? {region query) 
o What is the nature of the soil here? 
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o Where on this map do I find type of fossils FS22? 
o How much of geologic layer 1 is represented at depth d? 



At a metadata level 

o What are the existing soil classifications in this map? 
o When was this map created? 

2.2 Geology Researcher (User Type 2) 

A geologic map is the basic tool of geologists. Hence these sophisticated users 
need to access basic information as well but also to understand why objects 
relate to each other and thus access the theories that justify the existence or the 
particular aspect (e.g., attribute values or shape) of each geologic object that 
constitutes the map. For instance, they may require explanations behind given 
phenomena. 

Objects of Interest. These users access all objects described above but also 
another type of hyperobjects. As we will see in the next section, some geologic 
objects are for sure part of a map (with probability 1), and other objects are 
“guesstimated” by the map maker (with 0 < existence-probability < 1). Type 2 
users can access these two categories of objects as well as the complex structures 
behind them. 

Examples of Queries. The interesting queries in this category are those re- 
lated to object existence as well as assumptions or explanations on objects. The 
list is quite long as we want to show many possible manipulations of the struc- 
ture defined in the next section. 

At a basic data level 

o What are the geologic objects related to the existence of this one? 
o What are all the explanations of Ms. XYZ in this area? 
o What assumptions led to the definition of this object (if any)? 
o Which objects are not based on assumptions? 
o Which objects are defined exclusively under assumptions? 
o Which objects are defined using this explanation? 
o What is the explanation of the genesis of this layer? 

At a metadata level 

o What are the maps designed after October 1990? 
o What are all the geologic maps using Explanation E? 
o What are the names of the geologists who studied this area? 
o How many versions do I have for geologic map GM231? 
o Where should we install drill-holes in this map? 
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2.3 Designer (User Type 3) 

One peculiarity of this level is that, for a given map (i.e., covering a given area), 
many versions based on different interpretations can be defined, as described 
below. 

The map making process can be described by two basic mechanisms, abduc- 
tion and deduction. The geologist starts with a map of a given area. Currently, 
this map is often a topographic map. From such a map, s/he first looks at 
broad topographic features such as valleys and summits. Relationships between 
geologic borders and topography is then inferred. With knowledge of geologic 
structures at the surface (from drill-holes, for instance), the geologist also infers 
other geologic structures and groups them together. Groups are obtained by in- 
terpolation at the surface but also at the subsurface. Besides, s/he associates 
explanations or justification of the presence of certain geologic features together 
with a coefficient of confidence (abduction mechanism). 

Geologic map making is an iterative process. The map maker draws a first 
version of a map based of course on observed features but also on his/her in- 
terpretation of the whole scene. Then s/he verifies the hypothesis by running 
simulation models and going to the field. A new version is then inferred, and so 
on and so forth. Eventually, s/he will obtain a map that corresponds to his/her 
interpretation but which is not “frozen” as things may still evolve in the future 
(e.g., with new explanation or new observed facts). However, the delta between 
his/her successive versions usually tends to become smaller and smaller. What 
we want to provide the designer with, beside assistance in this process, is a 
possibility of storing many underlying theories. Thus many explanations can be 
associated with one object, with different coefficients of confidence. 

Explanations can be combined to form chains of explanations. Different com- 
binations of explanations lead to many interpretations for the same output. A 
given interpretation of a geologic region together with all its interpreted geologic 
features is a version of a geologic map (in the database sense) for that region. 
Hence there are two ways to create new versions of a geologic map GM in the 
database: 

1. Various explanations can be associated with one geologic object o (o G GM) 
by one or many geologists. 

2. Various definitions (identification and attribute values assignments) of a col- 
lection of geologic objects within the same region can be given by one or 
many geologists. 



In addition, the novelty of the approach that underpins our new generation 
of tools for map makers is the ability for them to start with existing geologic 
maps. Then those will be transformed and customized according to different 
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interpretations. Hence there is no need anymore to create a geologic map from 
scratch. 

Objects of Interest. The objects manipulated by these users are geologic maps 
composed of documented geologic features (in our model, to geologic object, 
we prefer the term geologic feature in order to stay closer to the geologist’s 
reasoning, although both terms are used interchangeably in the paper). Geologic 
features are simple or complex as defined below: 

geologic-feature = (AttributeList , SpatialExtent , UndergroundExtent) 

/* atomic geologic feature */ 

I (AttributeList, SpatialExtent, UndergroundExtent, 
{geologic-feature}-) /* complex geologic feature */ 

Complex geologic features are for instance stratigraphic or tectonic structures. 
Note that some values of a complex geologic feature (e.g., SpatialExtent and 
UndergroundExtent) can be inferred from those of subobjects ftT'SUj . 

There is a clear distinction between observed objects, hence existing for sure 
in a map (probability 1) and objects that are “guesstimated”. In the sequel we 
refer to observed objects as “hard objects” as opposed to “soft objects” . A soft 
object can become a hard object when an hypothesis is verified, while a hard 
object cannot turn into a soft object. The probability attached to the existence 
of a soft object can change according to a different interpretation. 

Note that within a given map, some external events can change the nature 
of geologic objects, for instance the introduction of a drill-hole introduces a 
new geologic object in the map. Associated with these objects are explanations 
together with coefficients of confidence. An explanation can justify the presence 
of an object or document the value of an attribute (e.g., the soil concentration 
of a component). 

Objects Manipulation and Querying. Obviously, querying does not play a 
crucial rule for this category of users. What is important is to provide designers 
with tools for defining and manipulating the underlying structure (see Section 3). 
However below are a few examples of queries and structure manipulation. As we 
can see from the selected collection of queries, many concerns are of interest for 
the designer, from the representation of a given attribute to the number of maps 
defined in a given area. 

At a basic data level 

o What are the interpretations using Geologic Model M? 
o What if assumptions of Ms. XYZ are not justified anymore? 
o What if model M turns out to be unapplicable in this area? 
o I know there should be cobalt here. Where should it be (shape/location)? 
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At a metadata level 

o What geologic model could I use to justify this tectonic structure? 
o What maps were defined with geologic map GM324 as a starting point? 
o How should I represent a layer with iron? 
o What are the maps published in area a? 

o What is the difference between interpretation II and interpretation 12? 

3 Supporting Geologic Map Making 

This section presents the kernel of a tool that assists map designers in the geo- 
logic map making process. We place ourselves in the context of a geologic map 
factory that communicates with a geologic map library. Eventually, objects de- 
fined in the map factory will be validated and stored in the geologic map library. 
Such a library is extremely useful in this context as it allows in particular geo- 
logic maps to be built with other maps as starting points. Browsing the library 
is hence a key functionality of the environment offered to geologic map makers. 
The generic geologic map library is not presented here. It is a special kind of 
geospatial library that supports multiversioning based on various 

interpretations. 

A geologic map factory contains many modules, among them a reasoning 
module. Components such as cartographic module, help module, or validation 
module are not our focus here. Rather we study the major task of the reasoning 
module, namely the support for abduction and deduction in the map making 
process. 

3.1 Reasoning Model 

In order to assist the designer in the map making process, the main object 
to consider is a reasoning structure, i.e., a documented geologic map (DGM). 
All the elements of a DGM, i.e., geological features (atomic or complex) and 
explanations are described thereafter, first in general terms and then using an 
02-like |BDK92| specification language. For the sake of simplicity and legibility, 
the specifications we give are kept as simple as possible. 



Geologic Features. The basic elements to consider are geologic features. These 
can be atomic (e.g., a fault) or complex (e.g., a tectonic structure). In any case 
they have a description (alphanumeric attributes), a spatial extension (spatial 
part that gives the shape and the location) and an underground extent. A geologic 
feature is of one of the two following types: 

o Hard type. Such objects are part of the map with probability one. The feature 
was for instance seen on the fields and the numeric values of its attribute 
could be computed. 
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o Soft type. These objects belong to the map with a probability (confidence) 
between 0 and 1. We will see further how they are created. 



Many kinds of relationships exist among geologic features. Beside compo- 
sition, topological relationships such as adjacency are of course of prime im- 
portance. These relationships are beyond the scope of our study. In addition, 
fuzzyness plays a crucial role in these applications as geological features can 
have (i) a fuzzy description (e.g., concentration of cobalt between 70 and 100 
%), (ii) a fuzzy spatial part (location not precise, shape with fuzzy borders), and 
(iii) a fuzzy underground extent. 

Schema definition: 

Class GeologicFeature type tuple 

(description: Attributelist , 
spatialextent : Spatial, 
undergroundextent : Solid) 

Class SimpleGeologicFeature inherits GeologicFeature 
Class ComplexGeologicFeature inherits GeologicFeature 

type tuple (geolf eatures : { GeologicFeature}) 



Class HardGF inherits GeologicFeature 
Class SoftGF inherits GeologicFeature 
type tuple (origin: Text) 



Classes AttributeList , Spatial, Solid are not detailed here. Spatial 
and Solid embody both geometry (coordinate location and shape) and topology. 
For a possible definition of Spatial (basically points, lines, curves, polygons and 
regions) in a database context, see |FF89L ISV92L IWor94I ICS95L ISch97| . Solid 
(volume) modeling (3D-geometric modeling) appears typically in CAD/CAM 
applications (see for instance IBlNfibl IUH97I '). 



Explanations. We are interested here in the process of documenting such maps, 

i.e., in the possible collections of explanations to justify: 

1. The existence of a geologic feature. 

2. The particular values of some attributes of a geologic feature. 

3. The presence of other explanations (when for instance an explanation con- 
tains references. Typically when a bibliographic reference is given in an ex- 
planatory text). 



In this context, an explanation can be of three different types: 

1. Provable (reliable). It can be justified by a hard fact, such as a drill-hole, or 
a geologic simulation model. Note the hierarchy in the reliability. 
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2. Similarity-based (training areas) . This occurs when some part of a map seems 
to be similar to a part of either the same map or of another map. 

3. Experience-based (or feeling). Such explanations can also mutate to become 
provable if an underlying assumption is verified. 



An explanation is a complex object composed of structured text, and possibly 
(bibliographic) references, references to simulation models, geologic features in 
the map, and other areas (coordinates) from either the same or a different map. 
Moreover, it contains the geologic features that serve as justification for the ar- 
gumentation as well as the geologic features that could be further consequences 
of this explanation through the deduction mechanism. Such arguments could be 
defined as query expressions over a geologic map (set of geologic features). The 
simple specification of an explanation is given below. Note the basic superclass 
Explanation which is further specialized into various explanation classes such 
as ProvableExplanation, SimilarityExplanation, and ExpertExplanation. 
In addition, an explanation is either basic or complex, which leads to classes 
BasicExplanation and ComplexExplanation. 

Schema definition: 

Class ExplcUiation type tuple 

(author: string, 
argument: {HardGF}, 
consequence: {SoftGFl) 

Class BasicExplanation inherits Explanation () 

Class ComplexExplanation inherits Explanation 

type tuple (all-explanations: {Explanation}) 

Class ProvableExplanation inherits Explanation 

/* e.g., models or drill-holes */ 

Class SimilarityExplanation inherits Explauiation 

/* similarity-based explanation */ 

Class ExpertExplanation inherits ExplcUiation 

/* experience-based explanation */ 

Class BasicExplanation type tuple (text: string) 

Class ModelExplanation inherits BasicExplanation 

type tuple (argument : string) 

/* argument = modelref */ 

Class BiblioExplanation inherits BasicExplanation 

type tuple (argument : bibitem) 

/* argument = bibliographic reference */ 
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Class HardObjectExplanation inherits BasicExplanation 

type tuple (argument: {GeologicFeature}) 

/* argument = geologic features */ 

Class SoftObjectExplanation inherits BasicExplanation 

type tuple (argument: {GeologicFeature}) 

/* argument based on geologic features */ 

Class AreaExplanation inherits BasicExplanation 

type tuple (area: zone) 

/* argument: Area, the coordinate of a region */ 

Environmental processes that lead to a given geologic feature are found out 
by looking recursively at all the explanations of type ModelExplanation. 



Complete and Documented Geologic Map (DGM). A complete geo- 
logic map is a 5-tuple (r,d,c,l,dgm), where r is the reference of the map in 
the map library, d the date of creation, c the coordinates of the covered area 
{dom{c) = ( 3 ?^ X 3 ?^)), I the legend used and dgm a documented geologic map 
defined as follows. 

A documented geologic map DGM is a set of directed acyclic weakly con- 
nected labeled graphs with the following characteristics: The source (entry) of 
each graph is a geologic feature (hard or soft). The (explanation) structures, EG 
attached to it with a certain coefficient of confidence is defined below. 

Definition 3.2. An explanation graph EG is a pair (E,P), where 

— E is a set of explanations (inner nodes and drains of the graph) . An expla- 
nation is atomic or complex, as described above. 

— P is a set of weighted edges. There is a weighted edge we with weight Pni,m+i 
from a node (including the source) n^ to a node ni_|_i if ni+i is an explanation 
of ni with confidence Pm,ni+i ((0 < Pm,ni+i < !)• The semantics of an edge 
we{pni,m+i) between n^ and n^+i corresponds to “ni+i is an explanation of 

ni with confidence Pm, m+i”- 



Figure El shows a documented geologic map composed of geologic features 
{GFl, ..., GFn}. In this figure, source objects of the form GF^ correspond to 
hard geologic features whereas objects of the form gb correspond to soft geologic 
features. As we can see, GFl can be explained by El with confidence pgfi.ei 
or by E2 with confidence pgfi,F2- In turn, E2 is explained by E3 with confi- 
dence pf 2 ,F 3 , which may itself be justified by a complex explanation containing 
both E4 and E5 (with confidence coefficient Pf3,(F4,F5)) or by E6 (confidence co- 
efficient PF 3 ,Fe)- The symbol I I denotes drains of the graphs (“terminal nodes”). 
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Fig. 2. A documented geologic map 



Schema definition: 

Class CompleteGeologicMap type tuple 

(ref er ence : string , 
date: date, 
legend: Legend, 

explgeomap : DocumentedGeologicMap) 



Class DocumentedGeologicMap type tuple 

(explhardGF: {ExplHardGF}-, 
explsoftGF : {ExplHSoftGF}) 

Method create-DGM (author: string) on class DocumentedGeologicMap 
/* Method create-DGM creates a map dynamically */ 

Class ExplHardGeologicFeatures type tuple 
(hardGF: HardGF, 

explanation-graph : ExplanationGraph) 

Class ExplSoftGeologicFeatures type tuple 

(softgeolf eature : SoftGF, 
explanation-graph : ExplanationGraph) 



Class ExplcUiationGraph type set (graph-element : WeightedExplauiation}- 
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Class WeightedExplcUiation type tuple (pi: integer) 

Class SimpleWE inherits WeightedExplanation 

type tuple (expl: Explanation) 

/* the explanation itself can be complex cf . (E4,E5) in Figure 3 */ 

Class ComplexWE inherits WeightedExplanation 

type tuple (expl: ExplanationGraph) 

The useful substructures of this documented geologic map are sets of paths 
associated with geologic features. Eventually, after the map making step, with a 
geologic feature is associated many possible chains of explanations (union of all 
the possible paths starting with this object as a source). 

3.2 Specifying Abduction and Deduction Processes 

Assuming the map maker starts with a collection of hard geologic features, s/he 
first associates explanations with the geologic features and then builds the struc- 
ture recursively. The map defined by method 

create-DGM below is then linked to a complete geologic map (with area coordi- 
nates, reference, date, etc.) 

The abduction step is described by Procedure create-DGM. Once this is fin- 
ished (i.e., explanations defined with coefficient of confidence), method inf er-all 
is invoked on the resulting structure (deductive mechanism). Method create-DGM 
can be called many times in order to assign many explanations to an object. 



Building a DGM. The following method generalizes the abduction process. It 
creates an explanation chain associated with each geologic feature of a map. 

Method create-DGM (author: string) on class DocumentedGeologicMap 
for each GF in + (self . explgeomap . explhardGF, self . explgeomap. expsoftGF) 

/* union set */ 
add-explanat ion-chain (0) 

@GF . explanation-graph. graph-elelement . explanation 
/* invokes method add-explanation-chain on all geologic features */ 



Method add-explanation-chain (done :BooleEui) on class Explanation 
{if (not done) 
e = new(Explanation) 
for each ge in self . explcuiation 
e . explanation = add-explanation@ge . explamation 
Print ( ’Done? ’ ) 

add-explanation-chain(done)@e . explanation 

} 
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Method add-explamation on class Explamation 

/* This method creates one weighted explEmation. 

It can be called separately in other contexts (e.g., change 
of explanation) . 

The coefficient of confidence and the contents of the explanation 
are entered dynamically. denotes a call for a function 

interacting with the user (input parameter) */ 

{e = new(Explanation) 
write (’Value for p?’) read(p) 
e.pi = p 

e.text = *textinput 
e. argument = *variableinputs 

} 

The deduction step that follows is realized by: 

Method infer-all on Class DGM 

/* This method goes recursively through an interpretation. 

It runs on all explanations Euid infer all possible consequent 
objects (Explanation. consequence) by propagating all chosen 
explanations*/ 



3.3 Operations on DGM 

In addition to classical graph manipulation, operations on this structure include: 

o Change a coefficient of confidence. 

method change-confidence (pi: integer) 
on class WeightedExplanation 
o Change an explanation. 

method change-explanation (Ei : Explanation) on class Explanation. 
In this method, the explanation receiving the method has to be looked for 
everywhere in the DGM and replaced by Ei. 
o Change the type of an explanation from similarity- or experience-based to 
provable. This is done by removing an explanation and inserting a new one 
of type provable. 

o Add a soft object (source) with explanations. 



Queries on this structure are for instance: 

o Retrieve all interpretations of documented map DM that use explanation 
El. 

o Retrieve all interpretations of documented map DM with a threshold of 0.5 
in the definition of all its geologic features. 

Other operations concern a change of definition of soft geologic objects, and 
more precisely of the structuring of complex objects. Suppose that two layers 
b and were assigned to the same stratigraphic unit (for instance because 
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they contain the same kind of rocks). Suppose also that this unit was inferred 
to be an anticline, which means in particular that age(b) < age(li-i-i). Assume 
that a new layer 1’ in between is discovered with age(l’) < age(b). This shows a 
contradiction and the two original layers should be part of different stratigraphic 
units. 



3.4 Support for Multiple Interpretations 

Multiple interpretations can be defined by extracting paths in the structure 
defined above. Thus a given set of objects can be associated with various in- 
terpretations. A more general concept is the one of version which is associated 
with a complete geologic map (see Section 3.2). It hence allows to customize the 
definition and structuring of the original set of geologic objects. The definitions 
are given below. 

Definition 3.5.1 An explanation chain EC is a 3-tuple {GF, P,p) where GF is 
a geologic object, P a path {P C EG, where EG is an explanation graph) and p 
a coefficient of confidence, p is defined as the multiplication of the coefficients of 
confidence of all edges of the chain. In Figure 0 explanation El is associated to 
GFl with confidence pgfi,ei- This path is likely to be part of a relevant version 
for the user compared to the second explanation for GFl (Explanation E2) which 
has for confidence {pgfi,E2 x PE2,E3 x Pe3,{E4,E5)) or {pgfi,E2 x pe2,E3 x Peg)- 

Definition 3.5.2 An interpretation Igm of a geologic map GM is a set of ex- 
planation chains. 

Definition 3.5.3 A weighted interpretation WIgm of a geologic map GM is 
a set of pairs (EC,p) where EC is an explanation chain and p a coefficient of 
confidence. 

Definition 3.5.4 A version Vcgm.Ig of a complete geologic map CGM is a pair 
{CGM,Ig) where CGM is a complete geologic map composed in particular of 
the set of geologic features G, and Ig a corresponding interpretation over this set. 

Figure 0shows a documented geologic map Ml with two different interpreta- 
tions, II and 12. The first interpretation at the top uses explanation chain (E2, 
E3, E6) for GFl, (E7,E6) for GF2 and (E9,E10) for GF3. The interpretation 
I{GFi,GF2,GF3} is then 

I1gfi.gf2.gf3 = {{E2, E3, E6), {E7, E6), {E9, ElO)}. 

At the bottom of Figure 0 the second interpretation also uses explanation 
chain (E2,E3,E6) attached to GFl, but different explanation chains are associ- 
ated with objects GF2 (Ghain (E3,(E4,E5))) and GF3 (Ghain (E14,E15)). 
Hence I2{gfi.gf2.gf3} = {{E2, E3, E6), (E3, (E4, E5)), (EU.E15)}. 
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Check for Consistency within an Interpretation. Consistency can be 
checked within an interpretation once all objects are frozen in the world of 
geologic features and explanations for a given version. Consistency will ensure 
that (i) relationships among geologic structures are not violated, (ii) explana- 
tions do not contradict each other. Defining consistency among explanations of 
one documented map is a complex problem. However, it is handled easily when 
justifications (edges in the graph) are based on the negation of explanations. 
For instance, suppose that El explains GFl and that one soft object sgf2 is 
also supposed to belong to the map (e.g., as a consequent object) based on the 
negation of El. 



4 Conclusion 

In this paper, we set a framework to manipulate geologic maps at different levels 
of abstraction. We identified three categories of users, from naive to expert, and 
we started with a description of their needs when communicating with a geologic 
hypermap system. The most sophisticated class of users, the designers, clearly 
need tools to define new geologic maps. 

We then explained the mechanism behind the map making process. To sum 
up, the geologist starts with a map (e.g., a topographic map) of a given area. 
This map contains observed surface phenomena. From this map, complex ge- 
ologic structures are extracted. This is an elaborate mechanism that leads to 
(i) inferences of subsurface structures, and (ii) abductive models of events and 
processes that shaped the geologic features. Explanations or a justification of 
the presence of certain geologic features are associated with them together with 
a coefficient of confidence. They may be verified or not, and other conclusions 
may be drawn (iterative process). In this paper, we focused on the definition 
and manipulation of a hypergraph of such explanations. Note that such gen- 
eral structures and mechanisms can be applied to other domains of application, 
and are used for instance in artificial intelligence. The abduction and deduction 
mechanisms have been used for years in medical diagnosis, route planning or 
equipment repair, among others. More recently, these technics started to be ap- 
plied by the genome community. Thus the idea behind the explanation structure 
(documented geological map) proposed here is not new. 

What is innovative, however, is to place this work in the context of geologic 
map design and manipulation. So far, geologic map making (that lead to analog 
maps) was a tedious task. Within a computerized environment, it can be greatly 
simplified, although many steps will still have to be performed manually, as full 
automation based on a complete knowledge-based system seems to be bound to 
fail after a few attempts. This simplification is due mainly to (i) an assistance 
to the geologists in order to help them to create new objects (based on map 
consultation, model verification, etc.), (ii) the possible customization of a map 
through the reuse of another version based on different interpretations. Both 
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aspects were shown is the paper. One of the future directions of our work is to 
specify the versioning process (i.e., definition and storage of complete geologic 
map versions) in a database environment. 

A geologic hypermap prototype for user types 1 and 2 (as defined in the first 
section) was coded using Arc View. We plan to implement a prototype based on 
the explanation model described in Section 3, and to link it eventually to our 
hypermap prototype. This prototype will eventually be linked to a geologic map 
library that is currently under specification. 

This work was done within the framework of a joint-project with geologists. 
Understanding the complex requirements of a foreign area of applications is a 
challenging task. Defining the appropriate tools for the next generation of geolo- 
gists who will use only elaborate digitized map versions is certainly an ambitious 
project that will take a long time. We believe, however, that several problems can 
be isolated and studied step-by-step. The purpose of the work presented here is 
to define a brick necessary to the realization of such a large and complex system. 
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Abstract. In many geographical applications there is a need to 
model spatial phenomena not simply by sharply bounded objects but 
rather through vague concepts due to indeterminate boundaries. Spatial 
database systems and geographical information systems are currently 
not able to deal with this kind of data. In order to support these appli- 
cations, for an important kind of vagueness called fuzziness, we propose 
an abstract, conceptual model of so-called fuzzy spatial data types (i.e., 
a fuzzy spatial algebra) introducing fuzzy points, fuzzy lines, and fuzzy 
regions. This papeiQ focuses on defining their structure and semantics. 
The formal framework is based on fuzzy set theory and fuzzy topology. 



1 Introduction 

Representing, storing, quering, and manipulating spatial information is impor- 
tant for many non-standard database applications. Specialized systems like ge- 
ographical information systems (GIS) and spatial database systems to a certain 
extent provide the needed technology to support these applications. So far, spa- 
tial data modeling has implicitly assumed that the extent and hence the borders 
of spatial phenomena are precisely determined, homogeneous, and universally 
recognized. From this perspective, spatial phenomena are typically represented 
by sharply described points (with exactly known coordinates), lines (linking a 
series of exactly known points), and regions (bounded by exactly defined lines 
which are called boundaries). Special data types called spatial data types (see 
FeinTi for a survey) have been designed for modeling these spatial data. We 
speak of spatial objects as instances of these data types. The properties of the 
space at the points, along the lines, or within the regions are given by attributes 
whose values are assumed to be constant over the total extent of the objects. Well 
known examples are especially man-made spatial objects representing engineered 
artifacts like highways, houses, or bridges and some predominantly immaterial 
spatial objects exerting social control like countries, districts, and land parcels 
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with their political, administrative, and cadastral boundaries. We will denote 
this kind of entities as crisp or determinate spatial objects. 

Increasingly, researchers are beginning to realize that the current mapping of 
spatial phenomena of the real world to exclusively crisp spatial objects is an in- 
sufficient abstraction process for many spatial applications and that the feature 
of spatial vagueness or spatial indeterminacy is inherent to many geographic data 
EFMl . Moreover, there is a general consensus that applications based on this 
kind of indeterminate spatial data are not covered by current GIS and spatial 
database systems. In this paper we focus on a special kind of spatial vague- 
ness called fuzziness. Fuzziness captures the property of many spatial objects 
in reality which do not have sharp boundaries or whose boundaries cannot be 
precisely determined. Examples are natural, social, or cultural phenomena like 
land features with continuously changing properties (such as population density, 
soil quality, vegetation, pollution, temperature, air pressure), oceans, deserts, 
English speaking areas, or mountains and valleys. The transition between a val- 
ley and a mountain usually cannot be exactly ascertained so that the two spatial 
objects “valley” and “mountain” cannot be precisely separated and defined in a 
crisp way. We will designate this kind of entities as fuzzy spatial objects. 

The goal of this paper is to present a formal object model for fuzzy points, 
fuzzy lines, and fuzzy regions in two-dimensional Euclidean space, an effort which 
is to lead to a fuzzy spatial algebra. We propose fuzzy set theory and fuzzy topology 
as appropriate conceptual tools for modeling indeterminate spatial data. Fuzzy 
set theory is an extension and generalization of classical set theory; the approach 
of fuzzy sets replaces the crisp boundary of a classical set by a gradual transition 
zone and permits partial and multiple set membership. For fuzzy regions, differ- 
ent views give a better understanding of their nature and also demonstrate how 
these objects can be represented as (collections of) crisp regions. Consequently, 
the current exact object models for crisp spatial objects can be considered as 
simplified special cases of a richer class of models for general spatial objects. It 
turns out that this is exactly the case for the model to be presented. 

Sections explains different aspects of spatial vagueness and presents related 
work. Section 0 introduces some basic definitions of fuzzy set theory and fuzzy 
topology as far as they are needed in this paper. Sections 00 and Elformally de- 
fine fuzzy points, fuzzy lines, and fuzzy regions, respectively. Since the definition 
for fuzzy regions does not expose their geometric structure. Section 0 provides 
several structured views of fuzzy regions based on collections of crisp regions. 
Section 0 draws some conclusions and gives a prospect of future research. 

2 Aspects of Spatial Vagueness and Related Work 

In current spatial data modeling, the entity-oriented view of spatial phenom- 
ena, which we will take in this paper, considers determinate spatial objects as 
conceptual and mathematical abstractions of real-world entities which can be 
identified and distinguished from the rest of space. For example, a crisp region 
partitions space into an interior, a boundary, and an exterior part which are mu- 
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tually exclusive and cover the whole space. Hence, the notion of a crisp region 
is intrinsically related to the notion of a boundary. This view fits very well with 
the mathematical concepts given by the Jordan Curve Theorem and ordinary 
point set topology. 

Boundaries are considered as sharp lines that represent abrupt changes of 
spatial phenomena and that describe and thereby distinguish regions with dif- 
ferent characteristic features. The assumption of crisp boundaries harmonizes 
very well with the internal representation and processing of spatial objects in 
a computer which requires precise and unique internal structures. Hence, in 
the past, there has been a strong tendency to force reality into crisp objects. 
In practice, however, there is no apparent reason for the whole boundary of a 
region to be determined. There are a lot of geographical application examples 
illustrating that the boundaries of spatial objects can be partially or totally inde- 
terminate or blurred. For instance, boundaries of geological, soil, and vegetation 
units |Alt94l Hjur96l ITT V 911 ILAB96j are often sharp in some places and vague in 
others; many human concepts like “the Indian Ocean” are implicitly vague. 

In the real world, there are essentially two categories of indeterminate bound- 
aries: sharp boundaries whose position and shape are unknown or cannot be 
measured precisely, and boundaries which are not well-defined or which are use- 
less (e.g., between a mountain and a valley) and where essentially the topological 
relationship between spatial objects is of interest. According to these two cat- 
egories, mainly two kinds of spatial vagueness can be identified: uncertainty 
and fuzziness. Uncertainty is traditionally equated with randomness and chance 
occurrence and relates either to a lack of knowledge about the position and 
shape of an object with an existing, real boundary (positional uncertainty) or to 
the inability of measuring such an object precisely (measurement uncertainty). 
Fuzziness is an intrinsic feature of an object itself and describes the vagueness 
of an object which certainly has an extent but which inherently cannot or does 
not have a precisely definable boundary. 

The subject of modeling spatial vagueness has so far been predominantly 
treated by geographers but rather neglected by computer scientists. At least 
three alternatives are proposed as general design methods: (1) exact models 
|(lF9fi[ RXI9fiL fES97bl ISch9fi| which transfer type systems and concepts for spa- 
tial objects with sharp boundaries to objects with unclear boundaries and which 
model both uncertainty and fuzziness but in a restricted way, (2) probabilistic 
models |Bia84L IBur9fiL IFin93L IShi93| which are based on probability theory and 
predominantly model positional and measurement uncertainty, and (3) fuzzy 
models mM iHjjHnr mm irxr^ irCTii mKfm 

which are all based on fuzzy set theory and predominantly model fuzziness. 

The exact object model approach profits from existing definitions, techniques, 
data structures, algorithms, etc. which need not be redeveloped but only mod- 
ified and extended. Except for the approaches are based on some kind 

of zone concept. Vague boundaries of a region are modeled as zones expressing 
the minimal and maximal possible extent of a region. Vague regions P97E! 
are a generalization of these models. A vague region is defined as a pair of dis- 
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joint, crisp regions. The first region called the kernel describes the area which 
definitely and always belongs to the vague region. The second region called the 
boundary describes the area for which we cannot say with any certainty whether 
it or parts of it belong to the vague region or not. Maybe it is the case, maybe it 
is not. Or we could say that this is unknown. Vague regions are based on a three- 
valued logic, and boundaries need not necessarily be one-dimensional structures 
but can be regions. 

Probability theory is able to represent uncertainty and defines the member- 
ship grade of an entity in a set by a statistically defined probability function. 
It deals with the expectation of a future event, based on something known now. 
Examples are the uncertainty about the spatial extent of regions defined by some 
property such as temperature, or the water level of a lake. 

Fuzzy set theory deals only with fuzziness. It describes the admission of 
the possibility (given by a so-called membership function) that an individual 
is a member of a set or that a given statement is true. Hence, the vagueness 
represented by fuzziness is not the uncertainty of expectation. It is the vagueness 
resulting from the imprecision of meaning of a concept. Examples of fuzzy spatial 
objects include mountains, valleys, biotopes, oceans, and many other geographic 
features which cannot be rigorously bounded by a sharp line. 

Another difference between fuzzy set theory and probability theory is that 
in the first case the possibility that an individual belongs to a set depends on 
subjective factors (e.g., expert knowledge) whereas in the second case probability 
can be computed formally or determined empirically and is thus more objective. 
Moreover, fuzzy set theory enables vague statements about one concrete ob- 
ject whereas probability theory makes statements about a collection of objects 
from which one is selected. Hence, fuzzy set theory models local vagueness while 
probability theory models global vagueness. 

The only proposal of a fuzzy data type relates to fuzzy regions |TFm1 defined 
as a fuzzy set over IN^. Each coordinate (x,y) G IN^ is associated with a value 
between 0 and 1 and describes the concentration of some feature attribute at 
that point. Unfortunately, the simple set property is insufficient since geometric 
anomalies can arise, as we will see later. The possible importance of fuzzy sets for 
geographical applications is demonstrated in |Hurf)lil ILAHhtil llJse9ti| where also 
examples of application-specific membership functions are given. The benefits 
of fuzzy set theory for approximate spatial reasoning and fuzzy query languages 
is shown in [Uut8f)l lUuthll IKVhlL IWanf)4| . |WHSf)()| models fuzzy objects by 
means of the relational data model. 



3 Fuzzy Sets and Fuzzy Topology 

Crisp regions have been formally defined on the basis of point sets and point set 
topology (e.g., jES97bl R la, a, 641 ISch97j ) which mainly rest on the set operations of 
union, intersection, and difference. In a straightforward way we will now describe 
extensions of these two concepts to fuzzy set theory and fuzzy topology. 
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Fuzzy set theory is an extension and generalization of Boolean set 

theory. Let X be a classical (crisp) set of objects, called the universe {of dis- 
course). Membership in a classical subset A of X can then be described by the 
characteristic function \A '■ X — > {0,1} such that for all x S X holds: 

j 1 if and only if x G A 
|0 if and only if x ^ A 

This function, which discriminates sharply between members and non- 
members of a set, can be generalized such that all elements of X are mapped to 
the real interval [0,1] indicating the degree of membership of these elements in the 
set in question. Hence, fuzzy set theory permits an element to have partial and 
multiple membership. Larger values designate higher grades of set membership. 
Let X again be the universe. Then 

[ 0 , 1 ] 

is called the membership function of A, and the set 

^ \ xeX} 

is called a fuzzy set in X. All elements of X receive a valuation with respect 
to their membership in A. Those elements x G X that in the classical sense do 
not belong to A get the membership value p.^{x) = 0; elements x G X that 
completely belong to A get the membership value ~ 1- 

There are many ways of extending the set inclusion as well as the basic crisp 
set operations to fuzzy sets. We will comply with the definitions in f/adli,^ . Let 
A and B be fuzzy sets in X. Then 

(i) -nA = |(x, /i^^(x)) \ xG X, p.^^{x) = 1 - n^{x)} 

(ii) A(1 B X G X : < ti§{x) 

(hi) i n H = |(x, iiAnsi^)) \ xGX A iZAoB^x:) = min(^^(x), ^3(2;))} 

(iv) iu H = {(x,/i^us(a;)) \ xGX Ap.^^jg{x)=\na.y.{p.^{x),p.g{x))} 

(v) A — B = Afi ^B 

A [strict] a-cut or [strict] a-level set of a fuzzy set A for a specified value a 
is the crisp set 

Aa [A*] = (x G A I pL^{x) > [>] a A 0 < a < [<] 1} 

The strict a-cut for a = 0 is called support of A, i.e., supp{A) = Aq. For a 
fuzzy set A and a,/3 G [0, 1] holds 

(i) A = Ao 

(ii) a < P => AaAAjs 
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The set of all levels a G [0,1] that represent distinct a-cuts of a given fuzzy 
set A is called the level set of A: 

A^ = {a G [0, 1] \ 3 X G X ■. n^{x) = a} 

Fuzzy (point set) topology is a straightforward extension and gen- 

eralization of ordinary point set topology and allows one to distinguish specific 
topological structures of a fuzzy set like its closure or interior. 

A fuzzy topology on a universe A is a family T of fuzzy sets in X satisfying 
the following conditions: 

(i) A G r,0G T 

(ii) AGf,BGT ^ AnB GT 

(hi) set ^ er 

The pair (A, T) is said to be a fuzzy topological space. The elements of T are 
called open fuzzy sets. Note that A and 0 are crisp sets and simultaneously special 
fuzzy sets. A corresponds to the fuzzy set A = {{x, fix{x)) | a: G A A p,x{x) = 
1}. The empty set 0 corresponds to the empty fuzzy set 0 = {{x, fix{x)) \ x G 
X A pLx{x) = 0}. We will identify A and A as well as 0 and 0 and use the crisp 
notations for these two sets. 

The family T' of all closed fuzzy sets in a fuzzy topological space (A, T) is 
given by 

f = {^A I i G f } 

The closure [interior] of a fuzzy set A in a fuzzy topological space (A, T) is 
the smallest closed [largest open] fuzzy set containing A [contained in A], i.e., 

cfy(i) = f]{s\ sgt' a Acs} 

[intf{A) = y I ^ G f A 5 C A}] 



4 Fuzzy Points 



Due to our assumption of the point set paradigm, an understanding of the nature 
of a point, or more precisely a fuzzy point, is necessary. There are at least two 
meaningful definitions for a fuzzy point. 

The first definition views a fuzzy point as a point in two-dimensional Eu- 
clidean space with a membership value greater than 0, since 0 documents the 
non-existence of a point. A fuzzy point p at (a, 6) in written p{a,b), is a 
fuzzy singleton in defined by 



t^p{a,b) {x, y) 



m A{x,y) = {a,b) 
0 otherwise 



with 0 < TO < 1. Point p is said to have support (a, b) and value to. Let Pf be the 
set of all fuzzy points. Pf is, of course, a proper superset of Pc, the set of all crisp 
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points in For p = p = (a,b) G Pc, we obtain lip(a,b){x,y) = Xpi^tU) = 1) if 
(x,y) = (a,b), and 0 otherwise. 

The second definition uses a membership function that returns the degree of 
proximity of a point to a reference point p. That is, we consider that the point 
{x,y) is “approximately (a,b)” or “about {a,b)” to the degree ytp(a,b){x,y). A 
fuzzy point p{a, b) is then generally defined by 

(i) Pp{a,b) is upper semicontinuou^ 

(ii) i^p(a,b){x,y) = 1 if and only if {x,y) = (a,b) 

(iii) V0<o;<l:pQisa convejfl subset of 

The concrete “distance-based” membership function 

with c € IR"'', c > 1, and A > 0 illustrates this definition. The degree of proximity 
decreases as (x,y) moves further away from (a,b). It reaches 1 if (x,y) = (a,b). 

Unfortunately, this membership function with unbounded support is difficult 
to represent. Alternatively, we can employ the following, restricted but more 
practical function which defines a circle around (a, b) with radius r G M'*’: 



t^p{a,b) y) 




y/(x-ay + (y-by 
r 



if {x — a)^ + (y — b)^ < r^ 
otherwise 



Next, we define three geometric primitives on fuzzy points which are valid 
for both definitions of fuzzy points. Let p{a,b),q(c,d) G Pf with a,b,c,d G M. 
Then 

(i) p{a, b) = q{c, d) :<t^ a = cAb = dA Pp(a,b) = Mg(c.d) 

(ii) p{a,b) q{c,d) ^{p{a,b) = q{c,d)) 

(iii) p{a,b) and q{c,d) are disjoint :<t^ supp{p{a,b)) C\ supp{q{c,d)) = 0 

In contrast to crisp points, for fuzzy points we also have a predicate for 
disjointedness. We are now able to define an object of the fuzzy spatial data 
type fpoint as a set of disjoint fuzzy points: 

fpoint = {Q C Pf\y p^q G Q : p{a, b) and q{c, d) are disjoint AQ is finite} 



5 Fuzzy Lines 

In this section we define a data type for fuzzy lines. For that, we first introduce a 
simple fuzzy line as a continuous curve with smooth transitions of membership 
grades between neighboring points of the line. We assume a total order on IR^ 
which is given by the lexicographic order “<” on the coordinates (first x, then 
y). The membership function of a simple fuzzy line I is then defined by 

pLj : fi [0, 1] with fi : [0, 1] — > IR^ such that 

^ A function / : A ^ IR is upper semicontinuous V r G IR : {a; | f{x) < r} is open. 
^ A set X C ]R^ is called convex X p,q G IR^ V A G IR''" with 0 < A < 1 : r = 
Xp -f (1 — A) g G A (p, q, and r are here regarded as vectors) 
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(i) fj-i is continuous 

(ii) fj is continuous 

(iii) V a, 6 G (0, 1) : a 7 ^ 6 ^ /,(o) ^ fj(b) 

(iv) V a G {0, 1} V 6 G (0, 1) : /,(a) ^ fi{h) 

(v) /f(0) < m V (/KO) = /f(l) A Va G ( 0 , 1 ) : m < fi{a)) 

Function fj on its own models a continuous, simple crisp line (a simple 
curve). The points /;(0) and /;(1) are called the end points of /. The defini- 
tion allows loops (/;( 0 ) = /;"(!)) but prohibits equality of interior points and 
thus self-intersections (condition (iii)) and equality of an interior with an end 
point (condition (iv)). The last condition ensures uniqueness of representation, 
i.e., in a closed simple line /[(O) must be the leftmost point. 

Let S be the set of fuzzy simple lines, and let T C S. An S-complex C over 
T is a finite set C = {^i, . . . , In] C T such thaJl 

(i) Vl<*<j<n:4((O,l))n4((O,l)) = 0 

(ii) y l<i<j <n: { 4 ( 0 ), 4 ( 1 )} n 4 (( 0 , 1 )) = 0 

(iii) y l<i<n3 1<j <nj ^i: {4(0),4(1)| n { 4 ( 0 ), 4 ( 1 )} ^ 0 

(iv) For all 1 < i,j < n and for all a,k G {0,1} let = {(}, /c) | fj_{a) = 
fi_{k)}. Then we require: V 1 < i < n V o G {0, 1} : (jv}“l = 1) V (|V)“I > 2) 

(v) V 1 < f < n V a G {0, 1} V (j, k) G 4 “ : /r,y (4 (a)) = (4 {k)) 

Condition (i) requires that the elements of an iS-complex do not intersect or 
overlap within their interior. Moreover, they may not be touched within their 
interior by an endpoint of another element (condition (ii)). Condition (iii) ensures 
the property of connectivity of an S'-complex; isolated fuzzy simple lines are 
disallowed. Condition (iv) expresses that each endpoint of an element of C must 
belong to exactly one or more than two incident elements of C (note that always 
{i,a) G y^)- This condition supports the requirement of maximal elements and 
hence achieves minimality of representation. Condition (v) requires that the 
membership values of more than two elements of C with a common end point 
must have the same membership value; otherwise we get a contradiction saying 
that a point of an S'-complex has more than one different membership value. 

All conditions together define an S-complex over T as a connected planar 
fuzzy graph with a unique representation. The corresponding point set of C is 
points{C) = /j-([0, 1]). The set of all S-complexes over T is denoted by 

SC{T). The disjointedness of any two S-complexes Ci,C 2 G SC{T) is defined 
as follows: 

Cl and C 2 are disjoint :<tA points{C\) C points{C 2 ) = 0 

A fuzzy spatial data type for fuzzy lines called fline can now be defined in 
two equivalent ways. The “structured view” is based on S-complexes: 

fline = {D C SC{S) | V Ci, C 2 G I? : Ci and C 2 are disjoint A D is finite} 

The application of a function / to a set X of values is defined as f{X) = {f{x) \ x G 
A}. 
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Let Ir = X {!}. The “flat view” emphasizing the point set paradigm is: 

fline = {Q C | 3 C SC{S) : points{C) = supp{Q)} 

CeD 



6 Fuzzy Regions 

The aim of this section is to develop and formalize the concept of a fuzzy region. 
Section lO informally discusses the intrinsic features of fuzzy regions, classi- 
fies them, gives application examples for them, and compares them to classical 
crisp regions. After this motivation. Section Iti . 21 provides their formal definition. 
Finally, Section lO gives examples of possible membership functions for them. 

6.1 What Are Fuzzy Regions? 

The question what a crisp region is has been treated in many publications. A 
very general definition defines a crisp region as a set of disjoint, connected areal 
components, called faces, possibly with disjoint holes ^SazEl Esnsj EcEazI in 
the Euclidean space IR^ . This model has the nice property that it is closed under 
(appropriately defined) geometric union, intersection, and difference operations. 
It allows crisp regions to contain holes and islands within holes to any finite 
level. 

By analogy with the generalization of crisp sets to fuzzy sets, we strive for 
a generalization of crisp regions to fuzzy regions on the basis of the point set 
paradigm and fuzzy concepts. At the same time we would like to transfer the 
structural definition of crisp regions (i.e., the component view) to fuzzy regions. 
Thus, the structure of a fuzzy region is supposed to be the same as for a crisp 
region but with the exception and generalization which amounts to a relaxation 
and hence greater flexibility of the strict belonging or non-belonging principle of 
a point in space to a specific region and which enables a partial membership of 
a point in a region. This is just what the term “fuzzy” means here. 

There are at least three possible, related interpretations for a point in a fuzzy 
region. First, this situation may be interpreted as the degree of belonging to which 
that point is inside or part of some areal feature. Consider the transition be- 
tween a mountain and a valley and the problem to decide which points have to 
be assigned to the valley and which points to the mountain. Obviously, there is 
no strict boundary between them, and it seems to be more appropriate to model 
the transition by partial and multiple membership. Second, this situation may 
indicate the degree of compatibility of the individual point with the attribute or 
concept represented by the fuzzy region. An example are “warm areas” where 
we must decide for each point whether and to which grade it corresponds to the 
concept “warm”. Third, this situation may be viewed as the degree of concentra- 
tion of some attribute associated with the fuzzy region at the particular point. 
An example is air pollution where we can assume the highest concentration at 
power stations, for instance, and lower concentrations with increasing distance 
from them. All these related interpretations give evidence of fuzziness. 
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When dealing with crisp regions, the user usually does not employ point sets 
as a method to conceptualize space. The user rather thinks in terms of sharply 
determined boundaries enclosing and grouping areas with equal properties or 
attributes and separating different regions with different properties from each 
other; he or she has purely qualitative concepts in mind. This view changes when 
fuzzy regions come into play. Besides the qualitative aspect, in particular the 
quantitative aspect becomes important, and boundaries in most cases disappear 
(between a valley and a mountain there is no strict boundary!). The distribution 
of attribute values within a region and transitions between different regions may 
be smooth or continuous. This feature just characterizes fuzzy regions. 

We now give a classification of fuzzy regions from an application point of view. 
The classification extends from fuzzy regions with highest vagueness and lowest 
gradation of attribute values to fuzzy regions with lowest vagueness and highest 
gradation of attribute values. The given application examples are basically valid 
for each class. How to model areal features as fuzzy regions depends on the 
application and on the “preciseness” and quality of information. 



Core-Boundary Fuzzy Regions. If there is only insufficient knowledge about 
the grade of indeterminacy of the vague parts of a region, a first approach is 
to differentiate between its core, its boundary, and its exterior which relate to 
those parts that definitely belong, perhaps belong, and definitely do not belong, 
respectively, to the region. This extension just corresponds to the approach of 
vague regions where core and boundary are modeled by crisp regions. It can 
also be simply modeled by a fuzzy region by assigning the membership function 
value 1 to each point of the core, value 0 to each point of the exterior, and 
value i (halfway between completely true and completely false) to each point 
of the boundary. It is important to note that a boundary in this sense can be a 
region and has thus a different and generalized meaning compared to traditional, 
crisp boundarie^. We will denote fuzzy regions based on a three- valued logic as 
core-boundary {fuzzy) regions. 

An application example is a lake which has a minimal water level in dry 
periods (core) and a maximal water level in rainy periods (boundary given as 
the difference between maximal and minimal water level) . Dry periods can entail 
puddles. Small islands in the lake which are less flooded by water in dry and 
more (but never completely) flooded in rainy periods can be modeled through 
holes surrounded by a boundary. If an island like a sandbank can be flooded 
completely, it belongs to the boundary part. 



Finite- Valued Fuzzy Regions. The next step lifts the restriction of having 
only one degree of fuzziness. The introduction of different degrees leads from 
fuzzy regions based on a three-valued logic to fuzzy regions based on a finite- 
valued and thus multivalued logic. This enables us to describe more precisely the 

^ Nevertheless, core, boundary, and exterior are separated from each other by ordinary, 
strict “boundaries” as we know them from ordinary point set topology. 
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degree of membership of a point in a fuzzy region. The membership function 
value I (|) could express that it is mostly true (false) and only a little false 
(true) that a point is an element of a specific fuzzy region. We will call this kind 
of fuzzy regions finite-valued (fuzzy) regions. If n G IN is the number of possible 
“truth values”, an n- valued membership function turns out to be quite useful 
for representing a wide range of belonging of a point to a fuzzy region. 

An application example are regions of different possibilities for virus infec- 
tions. Regions could be categorized by n different risk levels extending from areas 
with extreme risk of infection over areas with average risk of infection to safe 
areas. 

The two classes of fuzzy regions described so far have predominantly a quali- 
tative character. This means, that the numbers involved in membership functions 
of a fuzzy region only play a symbolic role and that their size is of lower impor- 
tance. Essentially, a total and bijective mapping is defined between n possible 
categories expressing different degrees of fuzziness and n discrete values out of 
the range [0, 1]. Although the selection of the n discrete values is arbitrary (they 
only must be disjoint from each other, and there is no order needed between 
them), they are usually chosen in a way that agrees with our intuition. 



Interval-Based Fuzzy Regions. The following two classes emphasize a more 
quantitative character of fuzzy regions. Consider an ordered set of n arbitrary 
but disjoint values of the interval [0, 1] and the assignment of exactly one of these 
values, let us say, v, to all points of a specific connected component c (a face) of 
a fuzzy region. We can then interpret such a value v for all points of c as their 
guaranteed minimal degree of belonging to c. Hence, v represents a lower bound. 
Since the set of values is ordered, each value v (except for the highest value) has 
a successor w with respect to the defined order, i.e., v < w. This implies that no 
point of c can have a value greater than w, since otherwise these points would 
have to be labeled with the value w. This justifies to implicitly map all points of 
c to the label [u, w], i.e., to a closed interval. The meaning is that the degree of 
membership of each point of c is somewhere between v and w (we do not have 
more information). We denote this kind of fuzzy regions as interval-based (fuzzy) 
regions. Each pair of the n — 1 possible intervals is either disjoint or adjacent 
with common bounds. All intervals together form a finite covering of the unit 
interval [0, 1]. 

An application example is a map about the population density of a country. 
According to a predefined interval classification, the country is subdivided into 
regions showing the minimal guaranteed population density per km^ for each 
region. The density values of different regions can be rather different. Another 
example are weather maps on television which usually show single reference 
temperatures as sample data spread over the map and representing temperature 
zones. Here we assume that a direct path from a lower to a higher reference 
temperature is accompanied by smoothly increasing temperatures. Transitions 
between different regions are here smooth. 
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Smooth Fuzzy Regions. A last and very important class of fuzzy regions, 
which has so far not been treated in the literature, takes advantage of avail- 
able knowledge about the distribution of attribute values within a fuzzy region. 
This knowledge can be gained by an expert through appropriate membership 
functions. We require that the distribution of attribute values within a fuzzy 
region is smooth (with a finite number of exceptions) . This can be achieved by 
so-called predominantly continuous membership functions. We call this kind of 
fuzzy regions predominantly smooth (fuzzy) regions. As a special case we obtain 
(totally) smooth (fuzzy) regions with no continuity gaps. 

There are a lot of spatial phenomena showing a smooth behavior. Application 
examples are air pollution (Figure Q), temperature zones, magnetic fields, storm 
intensity, and sun insolation. Predominantly smooth regions are the most general 
class of fuzzy regions and comprise all other aforementioned classes which are 
obviously (predominantly) continuous. This especially means that combinations 
of different classes are possible without any problems. 





Fig. 1. This figure demonstrates a possible visualization of a fuzzy region which 
could model the expansion of air pollution caused by a power station. The left 
image shows a radial expansion where the degree of pollution concentrates in the 
center (darker locations) and decreases with increasing distance from the power 
station (brighter locations). The right image has the same theme but this time 
we imagine that the power station is surrounded by high mountains to the north, 
the south, and the west. Hence, the pollution cannot escape in these directions 
and finds its way out of the valley in eastern direction. In both cases we can 
recognize the smooth transitions to the exterior. 



6.2 Formal Definition of Fuzzy Regions 

Since our objective is to model two-dimensional fuzzy areal objects for spatial 
applications, we consider a fuzzy topology T on the Euclidean space (plane) 
In this spatial context we denote the elements of T as fuzzy point sets. The 
membership function for a fuzzy point set A in the plane is then described by 
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From an application point of view, there are two observations that prevent 
a definition of a fuzzy region simply as a fuzzy point set. We will discuss them 
now in more detail and at the same time elaborate properties of fuzzy regions. 

Avoiding Geometric Anomalies: Regularization. The first observation 
refers to a necessary regularization of fuzzy point sets. The first reason for this 
measure is that fuzzy (as well as crisp) regions that actually appear in spatial 
applications in most cases cannot be just modeled as arbitrary point sets but 
have to be represented as point sets that do not have “geometric anomalies” 
and that are in a certain sense regular. Geometric anomalies relate to isolated 
or dangling line or point features and missing lines and points in the form of 
cuts and punctures. Spatial phenomena with such degeneracies never appear as 
entities in reality. The second reason is that, from a data type point of view, 
we are interested in fuzzy spatial data types that satisfy closure properties for 
(appropriately defined) geometric union, intersection, and difference. 

We are, of course, confronted with the same problem in the crisp case where 
the problem can be avoided by the concept of regularity jH;Sh7b| ISchh?! r niHOj . 
It turns out to be useful to appropriately transfer this concept to the fuzzy case. 
Let A be a fuzzy set of a fuzzy topological space (M^, T). Then 

A is called a regular open fuzzy set if A = intf{clf{A)) 

Whereas crisp regions are usually modeled as regular closed crisp sets, we 
will use regular open fuzzy sets due to their vagueness and their usual lack of 
boundaries. Regular open fuzzy sets avoid the aforementioned geometric anoma- 
lies, too. Since application examples show that fuzzy regions can also be partially 
bounded, we admit partial boundaries with a crisp or fuzzy character. For that 
purpose we define the following fuzzy set: 

frontier f {A) := {{{x,y), fi^{x,y)) \ (x,y) G supp{A) - supp{intf{A))} 

A fuzzy set A is now called a spatially regular fuzzy set iff 

(i) intf{A) is a regular open fuzzy set 

(ii) frontier f (A) C frontier f{clf{intf {A))) 

(iii) frontier f (A) is a partition of n connected boundary parts (fuzzy sets) 

We can conclude that frontier f {A) = 0 if A is regular open. We will base 
our definition of fuzzy regions on spatially regular fuzzy sets and define a reg- 
ularization function reg j which associates the interior of a fuzzy set A with its 
corresponding regular open fuzzy set and which restricts the partial boundary 
of A (if it exists at all) to a part of the boundary of the corresponding regular 
closed fuzzy set of A: 

reg f{A) := intf{clf{A)) U {frontier f (A) C\ frontier f {elf {intf{ A)))) 

The different components of the regularization process work as follows: the 
interior operator intf eliminates dangling point and line features since their 



Uncertainty Management for Spatial Data in Databases 343 



interior is empty. The closure operator clf removes cuts and punctures by ap- 
propriately adding points. Furthermore, the closure operator introduces a fuzzy 
boundary (similar to a crisp boundary in the ordinary point-set topological sense) 
separating the points of a closed set from its exterior. The operator frontier f 
supports the restriction of the boundary. 

The following statements about set operations on regular open fuzzy sets are 
given informally and without proof. The intersection of two regular open fuzzy 
sets is regular open. The union, difference, and complement of two regular open 
fuzzy sets are not necessarily regular open since they can produce anomalies. 
Correspondingly, this also holds for spatially regular fuzzy sets. Hence, we in- 
troduce regularized set operations on spatially regular fuzzy sets that preserve 
regularity. Let A, B be spatially regular fuzzy sets of a fuzzy topological space 
and let a—b = a — b for a > 6 and a—b = 0 otherwise (a, 6 G Then 

(i) A\Jr B := reg f{A\J B) 

(ii) Ar\r B ■= reg f{Ar\ B) 

(hi) A-r B := regf{{{{x,y),fi^_^g{x,y) \ {x,y) G A A 
(iv) ^rA:=regf{^A) 

Note that we have changed the meaning of difference (i.e., A B ^ A Hr 
~^B) since the right side of the inequality does not seem to make great sense 
in the spatial context. Regular open fuzzy sets, spatially regular fuzzy sets, 
and regularized set operations express a natural formalization of the desired 
dimension-preserving property of set operations. In the crisp case this is taken for 
granted but mostly never fulfilled by spatial type systems, geometric algorithms, 
spatial database systems, and CIS. 

Whereas the subspace RCCS of regular closed crisp sets together with the 
crisp regular set operations “U” and “n” and the set-theoretic order relation 
“C” forms a Boolean lattice this is not the case for SRFS denoting the 

subspace of spatially regular fuzzy sets. Here we obtain the (unproven but obvi- 
ous) statement that SRFS together with the regularized set operations “Ur” and 
“rir” and the fuzzy set-theoretic order relation “C” is a pseudo-complemented 
distributive lattice. 

This implies that (i) {SRFS, C) is a partially ordered set (reflexivity, anti- 
symmetry, transitivity), (ii) every pair A, B of elements of SRFS has a least 
upper bound A Ur H and a greatest lower bound A fir B, (iii) {SRFS, C) has a 
maximal element Ir := {{{x,y), p{x,y)) \ {x,y) G A p,{x,y) = 1} (identity of 
“rir”) and a minimal element Or := {((a;, y), /a(a:, y)) \ {x, y) G IR^ A p,{x, y) — 0} 
(identity of “Ur”), and (iv) algebraic laws like idempotence, commutativity, as- 
sociativity, absorption, and distributivity hold for “Ur” and “Hr” . 

{SRFS, C) is not a complementary lattice. Although the algebraic laws of 
involution and dualization hold, this is not true for the laws of complementarity. 
If we take the standard fuzzy set operations presented in Section0as a basis, the 
law of excluded middle A Ur ^A = Ir and the law of contradiction A Hr ^A = Or 
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do not hold in general. This fact explains the term “pseudo-complemented” from 
above and is no weakness of the model but only an indication of fuzziness. 



Modeling Smooth Attribute Changes: Predominantly Continuous 
Membership Functions. The second observation is that according to the 
application cases shown in Section Ih. 1 1 the mapping itself may not be arbi- 
trary but must take into account the intrinsic smoothness of fuzzy regions. This 
property can be modeled by the well known mathematical concept of continuity 
and results in special continuous membership functions for fuzzy regions. We 
say that a function / contains a continuity gap at a point xq of its domain if / 
is semicontinuous but not continuous at xq. Function / is called predominantly 
continuous if / is continuous and has at most a finite number of continuity gaps. 



Defining Fuzzy Regions. The type /region for fuzzy regions can now be de- 
fined in the following way: 

/region = {R G SRFS \ is predominantly continuous} 



6.3 Examples of Membership Functions for Fuzzy Regions 

In this section we give some simple examples of membership functions which fulfil 
the properties required in Section fd. 21 The determination of suitable membership 
functions is the difficulty in using the fuzzy set approach. Frequently, expert and 
empirical knowledge is necessary and used to design appropriate functions. We 
start with an example for a smooth fuzzy region. By taking a crisp region A 
with boundary Ba as a reference object, we can construct a fuzzy region on the 
basis of the following distance-based membership function: 



= 



1 



-A d{(x,y),BA) 



if {x,y) G A 
if (x,y) i A 



where a G M'*’ and a > 1, A G IR'*' is a constant, and d((x, y), Ba) computes the 
distance between point {x, y) and boundary Ba in the following way: 

d{{x,y),BA) =TAiTL{dist{{x,y)Rx ,y')) \ (x',y') G Ba} 

where dist{p,q) is the usual Euclidean distance between two points p,q G IR^. 
Unfortunately, this membership function leads to an unbounded spatially regular 
fuzzy set (regular open fuzzy set) which is impractical for implementation. We 
can also give a similar definition of a membership function with bounded support: 



j 1 if (x,y) G A 

= S if {x,y) ^ A, d((x,y),BA) < A 

I 0 otherwise 
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In the same way as the distance from a point outside of A to Ba increases 
to A, the degree of membership of this point to A decreases to zero. 

[lUseDfij also presents membership functions for smooth fuzzy regions. The ap- 
plications considered are air pollution defined as a fuzzy region with membership 
values based on the distance from a city center and a hill with elevation as the 
controlling value for the membership function. fLAB96| models the transition of 
two smooth regions for soil units with symmetric membership functions. 

A method to design a membership function for a finite-valued region with 
n possible membership values (truth values) is to code the n values by rational 
numbers in the unit interval [0, 1]. For that purpose, the unit interval is evenly 
divided into n — 1 subintervals and takes their endpoints as membership values. 
We obtain the set | n S IN, 0 < i < n — 1} of truth values. Assuming 

that we intend to model air pollution caused by a power station located at point 
p G IR^, we can define the following (simplified) membership function for n = 5 
degrees of truth representing, for instance, areas of extreme, high, average, low, 
and no pollution (a, b,c,dG IR"''): 






1 if dist{p, (x, y)) < a 
I if a < dist{p, (x, y)) < b 

< ^ if 6 < dist{p, (x, y)) < c 

j if c < dist{p, (x, y)) < d 

0 if c? < dist{p, {x, y)) 



7 Structured Views of Fuzzy Regions 

The formal definition of a fuzzy region given in Section is conceptually some- 
how “structureless” in the sense that only “flat” point sets are considered and 
no structural information is revealed. In the following four subsections some “se- 
mantically richer” characterizations of fuzzy regions are presented which enable 
a better understanding of fuzzy regions. On the one hand they subdivide fuzzy 
regions into fuzzy components and on the other hand they describe them as col- 
lections of crisp regions. Moreover, they give hints for a possible implementation. 



7.1 Ftizzy Regions as Multi-component Objects 

The first structured view considers a fuzzy region as a set of fuzzy components. 
For a definition we need a notion of connectedness for fuzzy regions. A separation 
of a fuzzy region M is a pair A, B of fuzzy subregions satisfying the following 
four conditions: 

(i) A^%,B^% 

(ii) Y = A\JrB 

(iii) A n intf{B) = 0 A intf{A) f] B = % 

(iv) \AC\r B\ is finite 
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If a separation of V into A and B exists, then V is said to be separated, and 
we call A and B to be disjoint. Otherwise Y is said to be connected. Note that 
condition (iii) of the definition uses the usual fuzzy intersection operation and 
not the one on spatially regular fuzzy sets since the latter requires two fuzzy re- 
gions as operands. The property of disjointedness (condition (iv)) requires that 
the two fuzzy subregions A and B may at most share a finite number of bound- 
ary points; this makes sense since otherwise they could be simply merged into 
one fuzzy subregion. We now continue this separation process and decompose 
a fuzzy region Y into its maximal set of pairwise disjoint fuzzy components 
Y = {Ai, . . . , An} (in the spatial context this decomposition is always finite) so 
that we obtain with / = {1, . . . , n}: 

(i) V i G / : i, ^ 0 

(ii) y = Uriel 

(iii) y i, j G I ,i ^ j : Ai n intf(Aj) = 0 A intf(Ai) C\ Aj = % 

(iv) y i,j G I,i ^ j : \Ai ("1^ Aj \ is finite 

(v) y i G I : (Ai is connected A ^ B G) Ai ■. B \s connected) 

We call each fuzzy component Ai a fuzzy face. Hence, we obtain: 

A fuzzy region is a set of pairwise disjoint fuzzy faces. 

A question arises whether also fuzzy holes can be identified from the point set 
view of a fuzzy region. This question has to be negated. Let us briefly consider 
the crisp case. If A is a crisp region, its faces can have holes which belong 
to the complement (exterior) of A, i.e., to IR^ — A, and are “enclosed” by A. 
Unfortunately, ordinary point set topology offers no method to extract holes 
from a (regular closed) point set as separate components; they are simply part 
of the complement. Note that this does not mean that regions with holes cannot 
be modeled. Some research work in for example, shows 

that this is possible by selecting a constructive approach. Roughly speaking, 
the idea is to assume that the holes of A are already given as regions and to 
subtract these holes from a “generalized region A*” being isomorphic to a closed 
disc and being the union of A and the holes. But since this a pure set operation, 
afterwards A “forgets” how it was produced and cannot reconstruct its past. 
Similarly to the crisp case, holes cannot be identified from a (spatially regular) 
fuzzy point set, since fuzzy topology also offers no concept of holes. 

Moreover, we are here faced with the problem of the nature of a fuzzy hole. 
By analogy with the crisp case, we could say that the fuzzy holes of a fuzzy 
region A exclusively contain all points that are enclosed by any fuzzy face of A 
and that have membership grade 0 in A. But then, a fuzzy hole is crisp and a 
subset of the set 

H = {((a:, y), 1) | (x, y) G supp(^A)} 

This model of a fuzzy hole is unsatisfactory in the sense that it only deals 
with those points enclosed by A that definitely do not belong to A. It does not 
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take into account the complement of those points of A belonging only partially 
to A, i.e., the model does not consider the set 

A = {((a:, y), m) \ {x, y) S supp{A) A m = 1 - p^{x, y)} 

called the anti- fuzzy region of A. 

One could argue that the points of A also belong to the fuzzy holes. And 
indeed, we will take this view. The consequence is that for a fuzzy face there 
exists exactly one fuzzy hole. 

7.2 Fuzzy Regions as Three-Part Crisp Regions 

The second structured view leads to a simplification of an originally smooth 
fuzzy region to a core-boundary region and thus to a change from a quantitative 
to a qualitative perspective. It distinguishes between the kernel, the boundary, 
and the exterior as the three parts of a fuzzy region. For a fuzzy region A, these 
parts are defined as crisp regions (regular closed setsj^: 

kernel{A) = regA{{x,y) G | p^{x,y) = 1}) 

houndary{A) = regA{(x,y) G | 0 < fj,^{x,y) < 1}) 

exterior{A) = regA{{x,y) G | p^(x,y) = 0}) 

The kernel identifies the part that definitely belongs to A. The exterior deter- 
mines the part that definitely does not belong to A. The indeterminate character 
of A is summarized in the boundary of A in a unified and simplified manner. 
Kernel and boundary can be adjacent with a common border, and kernel and/or 
boundary can be empty. This view corresponds exactly to the already described 
concept of vague regions with its three- valued logic [ES97bj . 

All in all, this view presents only a very coarse and restricted description 
of fuzzy regions since it differentiates only between three parts. The original 
gradation in the membership values of the points of the boundary gets lost. 
The benefit of this view lies in the implementation since efficient representation 
methods and algorithms for crisp regions can be used. 

7.3 Fuzzy Regions as Collections of Crisp ct-Level Regions 

The third structured view attempts to diminish the drawbacks of the three-part 
view of fuzzy regions and to avoid the great information loss in this representa- 
tion. It describes a fuzzy region in terms of nested a-level sets. Let A be a fuzzy 
region. Then we represent a region A^ for an a G [0, 1] as 

= regA{(x,y) G | p-A^x,y) > a}) 

® Correspondingly, for a crisp set A, the regularization function reg^ is defined as 
reg^{A) = clT^intT^A)) where T is a topology for a universe X and cIt and intr 
are the closure and interior operators on a topological space {X,T). 
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We call Aa an a-level region. Clearly, Aa is a crisp region whose boundary 
is defined by all points with membership value a. Note that Aa can have holes. 
The kernel of A, as it has been defined in Section o is then equal to Ai.o- A 
property of the a-level regions of a fuzzy region is that they are nested, i.e., if 
we select membership values 1 = ai > 02 > • • • > a„ > a„+i = 0 for some 
n € IN, then 

Aai ^ Aq.2 C • • • G G 

We here describe the finite, discrete case that enables us to model and im- 
plement finite- valued and interval-based regions. If A^ is infinite, then there are 
obviously infinitely many a-level regions which can only be finitely represented 
within this view if we make a finite selection of values. In the discrete case, 
if \A^\ = n -|- 1 and we take all these occurring membership values of a fuzzy 
region, we can replace ”C” by ”G” in the inclusion relationships above. This 
follows from the fact that for any p G Aa- — Aa^_^ with i G {2, ... ,n+ 1}, 
Ma(p) ~ continuous case, we get p^(p) G [ai,ai_i) which leads to 

interval-based regions. As a result, we obtain: 

A fuzzy region is a (possibly infinite) set of a-level regions, i.e., A = 

{Aai I 1 < * < |A^|} with ai > a^+i =k Aq,, C Aq,.^j for 1 < i < |A^| — 1. 

From the implementation perspective, one of the advantages of using (a finite 
collection of) a-level sets to describe fuzzy regions is that existing geometric 
data structures and geometric algorithms known from Computational Geometry 
can be applied. 

7.4 Fuzzy Regions as a-Partitions 

The fourth structured view is partially motivated by the previous one and de- 
scribes a fuzzy region as a partition. A partition in the spatial context, called 
a spatial partition ESSZa, is a subdivision of the plane into pairwise disjoint 
(crisp) regions (called blocks) where each block is associated with an attribute 
and where adjacent blocks are not allowed to be labeled with the same attribute. 
It differs from the set-theoretic notion of a partition in the sense that it, of course, 
relates to space and that it incorporates a treatment of common boundary points 
which at the same time may belong to two adjacent blocks. 

From an application point of view, different blocks of a spatial partition 
are often marked differently, i.e., different labels of some set L are assigned to 
different blocks. Thus, in a certain way, L determines the type of a partition. 
This leads to spatial partitions of type L that are functions tt : IR^ — > L. In most 
cases, partitions are defined only partially, i.e., there are blocks (frequently called 
the exterior of a partition) which have no explicitly assigned labels. To complete 
7T to a total function, we assume a label Tl (called undefined or unknown) for 
each label type L and require that the exterior of a partition is labeled by T^. 

Like for crisp regions, we also desire regularity for the blocks of a spatial 
partition. We require the interiors of blocks to be regular open sets. Since points 
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on the boundary cannot be uniquely assigned to either adjacent block, we cannot 
simply map them to single L-values. Instead, boundary points are mapped to 
the set of values given by the labels of all adjacent blocks. This leads to the 
definition of a spatial mapping of type L as a total mapping tt : > L U 2^. 

The range of a spatial mapping tt yields the set of labels actually used in tt and 
is denoted by range{n). The blocks of a spatial mapping tt are point sets that are 
mapped to the same labels. The block for a single label I (or a set S of labels) 
is givei0 by (f~^{S)). The common label of a block 6 of tt is denoted 

by 7t[6], i.e., 7t(6) = {^} 7r[6] = 1 . Obviously, the cardinality of block labels 

identifies different parts of a partition. A region of tt is any block of tt that is 
mapped to a single element of L, and a border of tt is given by a block that is 
mapped to a set of L-values, or formally for a spatial mapping tt of type L: 

(i) p{tt) = TT~^ {range ( it) n L)) {regions) 

(ii) /9(7 t) = 7r“^(ronge(7r) n 2^)) {borders) 

Now we can finally define a spatial partition by topologically constraining 
regions to regular open sets and by semantically constraining boundary labels 
to those of adjacent regions. 

A spatial partition of type L is a spatial mapping tt of type L with 

(i) V r € p{tt) : r is a regular open set (i.e., r = intT{clT{r))) 

(ii) V 6 G P{tt) : Tr[b] = {7r[r] | r G p{n) Ab C cZt(»")} 

The set of all spatial partitions of type L is denoted by [L], i.e., [L] C ]R^ ^ 
LU2^. 

Using the representation based on a-level regions defined in the preceding 
subsection, we are now able to define a fuzzy region as a spatial partition. In our 
case L = i.e., the labels are formed by all possible membership values a. We 

have now to determine the different blocks for regions and borders. The regions of 
A are giveiflby the set {intT{Aa- — cAq,^_j) | i G {2, . . . , n-|-l}}, and the borders 
of A are represented by the set {boundr{Aai — c Aq,._j) \ i G {2, . . . ,n + 1}}. 
The object Aa- — c is a region possibly with holes. Each region is uniquely 

associated with an a G A^, and each border has all a-labels of adjacent regions. 

A fuzzy region A is a spatial partition of type A^ (i.e., A G [Aj^]), called 

an a-partition. 

If A^ is infinite, we get an infinite spatial partition. 



® We use the following definition of function inverse: for f : X ^ Y and y G Y \ 
f~^{y) '.= {x G X \ f{x) = y}. Note that f~^ applied to a set yields a set of sets. 

^ In the following, the operation c” denotes the regular difference operation on 
regular closed sets. The operation boundr applied to a regular closed set yields its 
point-set topological boundary. 
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8 Conclusions and Future Work 

This paper lays the conceptual and formal foundation for the treatment of spatial 
data blurred by the feature of fuzziness. It is also a contribution to bridge the 
gap between the entity-oriented and field-oriented view of spatial phenomena 
since the transitions between both views now become more and more flowing. 
The paper focuses on the design of a type system for fuzzy spatial data and leads 
to three fuzzy spatial data types for fuzzy points, fuzzy lines, and fuzzy regions 
whose structure and semantics is formally defined. The characteristic feature of 
the design is the modeling of smoothness and continuity which is inherent to the 
objects themselves and to the transitions between different fuzzy objects. This 
is achieved by the framework of fuzzy set theory and fuzzy topology which allow 
partial and multiple membership and hence different membership degrees of an 
element in sets. Different structured views of fuzzy regions as special collections 
of crisp regions enable us to obtain a better understanding of their nature and 
to decrease their complexity. 

Future work will have to deal with the formal definition of fuzzy spatial 
operations and predicates, with the integration of fuzzy spatial data types into 
query languages, and with implementation aspects leading to sophisticated data 
structures for the types and efficient algorithms for the operations. 
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1 Introduction 

Conventional relational databases often do not have the technology required 
to handle complex data like spatial data. Unlike the traditional applications of 
databases, spatial applications require that databases understand complex data 
types like points, lines, and polygons. Typically, operations on these types are 
complex when compared to the operations on simple types. Hence relational 
database systems need to be extended in several areas to facilitate the stor- 
age and retrieval of spatial data. Several research reports have described the 
requirements from a database system and prioritized the research needs in this 
area. 

A broad survey of spatial database requirements and an overview of research 
results is provided in Mami. Research needed to improve the performance of 
spatial databases in the context of object relational databases was listed in 
The primary research needs identified were extensible indexing and optimizer, 
concurrency control techniques for spatial indexing methods, development of cost 
models for query processing, and the development of new spatial join algorithms. 
Many of the system requirements identified in 0 have since been addressed in 
some commercial systems uisinj. In this context, we describe our experiences in 
implementing a spatial database on top of Oracle’s extensible architecture. 

1.1 Requirements of a Spatial Database System 

Any database system that attempts to deal with spatial applications has to 
provide the following features: 

— A set of spatial data types to represent the primitive spatial data types 
(point, line, area), complex spatial data types (polygons with holes, collec- 
tions) and operations on these data types like intersection, distance, etc. 

— The spatial types and operations on top of them should be part of the stan- 
dard query language that is used to access and manipulate non spatial data in 
the system. For example, SQL in case of relational database systems should 
be extended to be able to support spatial types and operations. 



R.H. Giiting, D. Papadias, F. Lochovsky (Eds.): SSD’99, LNCS 1651, pp. 355-|2S2l 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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— The systems should also provide performance enhancements like indexes 
to process spatial queries (range and join queries), parallel processing, etc. 
which are available for non spatial data.’ 

2 Oracle’s Spatial 

OracleSi Spatial 0 provides a completely open, standards based architecture for 
the management of spatial data within a database management system. Users 
can use the same query language (industry standard SQL) to access the spatial 
data and all other data in the database. The functionality provided by OracleSi 
Spatial is completely integrated within the Oracle database server. Users of spa- 
tial data gain access to standard OracleSi features, such as a flexible client/server 
architecture, object capabilities, and robust data management utilities, ensuring 
data integrity, recovery, and security features that are virtually impossible to ob- 
tain with other architectures. OracleSi Spatial enables merging GIS (Geographic 
Information System) and MIS (Management Information System) data stores 
and implementing a unified data management architecture for all data across 
the enterprise. The OracleSi Spatial provides a scalable, integrated solution for 
managing structured and spatial data inside the Oracle server. 

2.1 Spatial Data Modeling 

Oracle Spatial supports three primitive geometric types and geometries com- 
posed of collections of these types. The three primitive types are: (i) Point, (ii) 
Line String, (iii) and N-point polygon where all these primitive types are in 2- 
Dimensions. A 2-D point is an element composed of two ordinates, X and Y. Line 
strings are composed of one or more pairs of points that define line segments. 
Any two points in the line segment can be connected either by a straight line 
or a circular arc. That means line strings can be composed of straight line seg- 
ments, arc segments or a mixture of both. Polygons are composed of connected 
line strings that form a closed ring and the interior of the polygon is implied. 
A geometry is the representation of a spatial feature, modeled as a set of primi- 
tive elements. A geometry can consist of a single element or a homogeneous or 
heterogeneous collection of primitive types. A layer is a collection of geometries 
which share the same attribute set. For example, one layer in a GIS might in- 
clude topographical features, while another describes population density, and a 
third describes network of roads and bridges in an area. 

2.2 Operations on the Spatial Data Types 

The binary topological relationships between two spatial objects A and B in 
the euclidean space is based on how the two objects A and B interact with re- 
spect to their interior, boundary and exterior. This is called the 9-intersection 
model P] for the topological relationships between two objects. In this model, 
one can theoretically distinguish between 2® = 512 binary relationships between 
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A and B. In case of 2-dimensional objects, only eight relations can be realized 
which provide mutually exclusive and complete coverage for A and B. These re- 
lationships are contains, coveredby, covers, disjoint, equal, inside, overlap, touch. 
Oracle Spatial supports this 9-intersection j2| model for determining the topolog- 
ical relationships between two objects. In addition the system can also support 
other relationships derived as a combination of the above 8 relations. For exam- 
ple, O VERB APBDYDIS JOINT can be defined as the relation where the objects 
overlap but the boundaries are disjoint. Oracle Spatial also provides a within 
distance function where the distances are calculated in the Euclidean space. 
In addition, this system also provides set theoretical operations like UNION, 
INTERSECTION, DIFFERENCE and SYMMETRIC-DIFFERENCE. For ex- 
ample, given two spatial objects A and B, one can compute and return a new 
object C which is the UNION of A and B. 

2.3 SQL Support for Spatial Data 

Query language is the principal interface to the data stored in a relational 
database system. A popular commercial language used for accessing data in 
a RDBMS is SQL. Traditional SQL has been extended recently to be able to 
support access for new data types. In case of OracleSi Spatial, SQL is extended 
in two ways: SQL can be used to define and create objects of spatial types. SQL 
can also be used to insert, delete, update spatial types in addition to being able 
to query the spatial data with the help of spatial functions. For example to find 
out all the parks in city which overlap the rivers in the city can be found using 
the SQL query: 

SELECT A. feature FROM parks A, rivers B 

WHERE sdo_geom.relate(A. geometry, B. geometry, ‘OVERLAP’) = TRUE; 

2.4 Spatial Indexing 

The introduction of spatial indexing capabilities into the Oracle database engine 
through Oracle Spatial is a key feature. A spatial index acts much as any other 
index as a mechanism to limit searches within tables (or data spaces) based on 
spatial criteria. An index is required to be able to efficiently process queries like 
find objects within a data space that overlap a query area (usually defined by 
a query polygon) and find pairs of objects from within two data spaces that 
spatially interact with one another (spatial join) . 

A spatial index in spatial cartridge is a logical index. The entries in the 
index are dependent on the location of the geometries in a coordinate space, 
but the index values are in a different domain. Index entries take on values 
from a linearly ordered integer domain while coordinates for a geometries may 
be pairs of integer, floating, or double-precision numbers. Spatial cartridge uses 
a linear quadtree based indexing scheme, also known as z-ordering which maps 
geometric objects to a set of numbered tiles. Point data can be very well indexed 
by a recursive decomposition of space. Spatial object with extent, such as area 



358 Siva Ravada and Jayant Sharma 



or line features create a problem for this sort of index, because they are highly 
likely to cross index cell partition boundaries. Alternative indexing mechanism, 
such as R-trees, have been proposed based on overlapping index cells (a non- 
hierarchical decomposition) . Oracle Spatial chooses to take another approach to 
the problem. Each item is allowed multiple entries in the index. This allows one 
to index features with extent by covering them with the decomposition tiles from 
a hierarchical decomposition. 

Extensible Indexing in Oracle With Oracle’s extensible indexing framework, 
applications can defines the structure and access methods for the application spe- 
cific data. (This is called the domain index in Oracle.) The application can store 
the index data either inside the Oracle database (e.g. in the form tables) or out- 
side the Oracle database (in the form of files). And the application can define 
routines that manage and manipulate the index to evaluate SQL queries. In ef- 
fect, the application controls the structure and semantic content of the domain 
index. The database system interacts with the application to build, maintain, 
and employ the domain index. The main advantage of this extensible indexing 
framework is that the index is always in sync with the data. That is once the 
index is build, all the updates on the base table will automatically result in up- 
dates in the index data. Thus the users are not required to worry about the data 
integrity and correctness issues. Once the domain index is built, it is treated like 
a regular B-tree index. The database server knows the existence of this domain 
index and thus manages all the index related work using user defined functions. 
The extensible indexing framework also provides hooks into the optimizer to 
let the domain index creator educate the optimizer about the cost functions 
and selectivity functions associated with the domain index. The optimizer can 
then generate execution plans that make educated choices regarding domain in- 
dexes. Oracle Spatial built an indexing mechanism using this extensible indexing 
framework which is completely integrated with the database system. This also 
provides full concurrency control that is available to non spatial data and b-tree 
indexes in the database. 

2.5 Query Processing 

Queries and data manipulation statements can involve application-specific op- 
erators, like the Overlaps operator in the spatial domain. Oracle’s extensible 
framework lets applications/users define operators and tie the operators to a 
domain index. This lets the optimizer choose a domain index in evaluating a 
user defined operator. Oracle Spatial defined operators which are very common 
for many of the spatial applications. The spatial queries are evaluated using the 
popular two-step method: a filter step and a refinement step. A spatial index is 
used during the filter step and the actual geometries are used in the refinement 
step. This two-step process is used in both the window-query case and the spatial 
join case. 

For example, Oracle Spatial provides an SDOJIELATE operator which can 
be used compute if two geometries overlap with each other. If we want to find 
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all the roads through a county where the road intersects the county boundary, 
the query will look like this: 

SELECT a.id FROM roads A, counties B 
WHERE B.name = ’MIDDLESEX’ 

AND SDO_RELATE(A.geometry, B.geometry, ’MASK=OVERLAP’) = ’TRUE’; 

This query shows a simple example where a non spatial attribute and a 
spatial attribute is used in the same query. Assume that there is only one row 
in the counties table that satisfies the predicate on counties. name column. Then 
optimizer in this case will be able to choose a B-tree index on counties. name 
column and use the spatial index to evaluate the SDOJIELATE operator as a 
window query on the roads table. 

3 Conclusions 

In this paper, we described our experiences in implementing a spatial database 
on top of Oracle’s extensible framework. We described how the query language, 
data modeling and query processing issues are addressed in this system. How- 
ever, there is still more research required in areas like partitioning techniques 
to support parallel query processing and bulk loading, and spatial clustering. In 
addition, there is a growing need for a industry wide benchmark for measuring 
performance of different database systems supporting spatial databases. 
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1 Introduction 

We discuss in this paper the major issues we faced during the design, proto- 
typing and implementation of the “Sistema di Interscambio Catasto-Comuni” 
(SICC), namely the system for Italian cadastral data exchange among the main 
organizations involved in Italy with the treatment of cadastral information, that 
are Ministry of Finance, Municipalities, Notaries, and Certified Land Surveyors. 

The definition and design phases, conducted with the direct involvement 
of all communities interested to cadastral data, allowed to identify a new and 
promising approach, namely the access keys warehouse, for the realization of 
large distributed spatial applications that have the absolute requirement of in- 
tegrating legacy spatial databases. 

2 Starting Point and Objectives 

Cadaster, a Department of the Italian Ministry of Finance, is the public registry 
of the real estates and land properties. It was established for fiscal purposes. 
The main key to access cadastral information concerning real estates and land 
properties is expressed in terms of a unique cadastral identification code, made 
up by Municipality code, map sheet number, parcel number and flat number. 

A Municipality has the objective of planning and managing land use. For 
this purpose it mainly uses two keys. The cadastral identification code, as above 
defined, and the property location expressed in terms of street, civic number, 
and flat number. 

* M.Talamo is the CEO of the Initiative for the Development of an IT Infrastructure 
for Inter-Organization Cooperation (namely, “Coordinamento dei Progetti Interset- 
toriali”) of AIPA - “Autorita per I’lnformatica nella Pubblica Amministrazione”, 
http : //www. aipa. it. 
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Cadaster data is managed by the Land Department of the Ministry of Finance 
through its Land Offices ( “Uffici del Territorio” ) that are present at the level of 
Provinces (one Office for each Province), which are a subdivision of the main 
administrative partition of Italy in Regions and an aggregation of Municipalities. 

The Ministry of Finance and Municipalities use cadastral data for taxation 
of properties. According to italian law, taxes on real estates and land properties 
have to be based on their cadastral value (“rendita catastale”), computed on 
the basis of a number of objective parameters depending on the property char- 
acteristics. General parameters always used are the size and the location of the 
property. Once values for such parameters are know the cadastral value is auto- 
matically computed. The Ministry of Finance is currently evaluating a proposal 
to introduce also parameters related to the market value of the property. 

Furthermore, through its Estate Public Registry Offices ( “Conservatorie Im- 
mobiliari”), the Ministry of Finance also keeps record of and certify ownership 
rights and mortgage rights relative to properties. 

Municipalities also have their databases about real estates and land properties. 
These are used, as set by the law, to support and manage actions in the sectors 
of Toponymy, Local Fiscality, Public Works, and Land Management. 

Size of data bases managed by Municipalities is largely variable, considering 
that about 6.000 of the 8.102 italian municipalities have less then 5.000 citizens, 
but 8 of the 20 Region chief towns have more than one million inhabitants. 

It is clear that there is a continuous exchange flow of cadastral data among Mu- 
nicipalities, Ministry of Finance, Notaries and Certified Land Surveyors. Cur- 
rently the exchange of cadastral information takes place mostly using papers. 

Note also that cadastral databases are not managed at a single central location 
but at the more than 100 Land Offices of the Ministry of the Finance. This 
means that there is not a single centralized system, but more than 100 systems, 
geographically distributed over the whole italian territory. 

The cadastral databases contains about about 300.000 maps, approximately 
one third of it is in an electronic form, and about 1,5 millions of geodetic ref- 
erence points. These maps (“Catasto Terreni”) are the geodetic reference for 
land parcels and for the planimetry of the building possibly existing within land 
parcels. The planimetry of various flats inside the building is recorded, together 
with other descriptive data, in “Catasto Fabbricati” and has not a direct geodetic 
reference. 

Typical inquiries on cadastral databases produce the certificates needed by no- 
taries in all sale acts and buyers pay a fee to obtain them from the Cadaster. 
Usually the certificate is about location and cadastral value. Additionally, to- 
pographic and geodetic information (for land parcels) or planimetry (for real 
estates) may be asked. 

Note that data of geometric type are often required during sale transactions 
to check if the current situation of the land property/ real estate is coherent 
with respect to to the situation recorded in the cadastral databases. 
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Every year in Italy there are about 1,5 millions requests of cadastral certifi- 
cates and in one of the largest provinces there are about 100.000 yearly requests. 

Updates to cadastral data bases are requests to change, for a given real estate/ 
land parcel, some piece of information of geometric nature or of descriptive 
nature. They can be related to the creation of a new building (with its flats) in 
a land parcel or to the variation of an existing building or flat or to the change 
of some descriptive data (e.g., the owner). 

The number of yearly geometric updates to cadastral databases is about 
250.000. These updates always triggers further updates, since a change affects 
one or more of the following aspects: property rights and mortgage rights, fiscal 
nature and parameter values, destination and allowable usage. 

In 1995, the situation was the following : 

— cadastral data recorded in Cadaster are not, in general, up to date with 
cadastral data recorded in Municipalities, and both are not, in general, ex- 
actly describing the situation in reality. It has been estimated through a 
sample of 5% of the overall data, that about 40% of the whole set of data 
held in cadastral data bases is not up to date with the reality. Please note 
that this refers to the overall set of data, including both data generated by 
the cadastral offices and data received from the outside. Data generated in- 
side cadastral offices are usually up-to-date, hence the Cadaster is able to 
perform its main role. The greatest source of incoherence is in data received 
from the outside, and the consequence of this is a great difficulty to estab- 
lish a reliable correlation between a situation represented in the cadastral 
databases and a situation represented in databases of other organizations. 

— the way cadastral data change as consequence of actions taken by munici- 
palities, on one side, and by the Ministry of Finance, on the other side, are 
usually different. This is the main reason for the lack of correlation between 
data held by Municipalities and data held by Ministry of Finance, notwith- 
standing the large efforts that are periodically taken to force the correlation. 
It has been estimated that about 10% of the overall data changes every year. 

To deal with coherence maintenance issues in cadastral data exchange, as 
required by the law ra, AIPA started in 1995 the SICC project f2l3l4l5j . with 
the participation of Ministry of Finance and ANCI, the association of italian 
municipalities. 

The overall objective of the SICC project was to provide technical tools to over- 
come this situation without affecting the current relations and kind of interaction 
among interested entities and without changing their inner work organization. 

At the same time, law required an organizational change towards a de- 
centralization of activities. But this had anyway to be implemented in such 
a way to keep at the center the roles of validation and high-level control for 
cadastral information. 
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3 Our Solution 

It is clearly not possible to proceed in such a situation with an approach based 
on building new spatial databases, even if the database is distributed and based 
on a “federation” concept. In fact, this would require a huge amount of resources 
and would not satisfy the need of keeping the current system working for the 
everyday needs of citizens. 

Also, an approach based on the usual “data warehouse” concept would not 
be adequate, given the high dynamicity of data and the strong emphasys on the 
certification purposes of the overall system. 

Hence we defined and used in the SICC project the concept of Access Keys 
Warehouse. This approach is conceptually simple but it has shown in the SICC 
project its high effectiveness. 

With this approach a data repository containing all data items that can be 
found in various databases of a distributed systems is set-up only from a virtual 
point of view, while data items remains at their physical locations. 

An Access Keys Warehouse is then made up by two main components (for more 
details see m-- 

— An exchange identifier database, that is physically built and contains ac- 
cess keys and logical links for data items in the various databases of the 
distributed system. The access keys are attribute names, selected from the 
existing attributes in the underlying databases: the main rule in order to 
select them is that their concatenation constitutes a unique identifier for 
the data item. Logical links provide the access paths to the physical (dis- 
tributed) databases where further data elements about the identified data 
can be found. 

Note that attributes in the exchange identifiers database act towards legacy 
systems as access keys: their value is used to query legacy systems. Hence 
they are not physical pointers, and the legacy systems maintain their inde- 
pendence and transparency both with respect to location and to implemen- 
tation. 

The exchange identifier database is populated using data existing in the 
various distributed locations. 

— A coherence manager, that is triggered by updates occuring in the various 
databases of the distributed system. It activates data and control flows to- 
wards the distributed locations so that the various databases can be kept up 
to data as a consequence of a change happened to a data item in a specific 
location. 

The use of the Access Keys Warehouse concept fully supports the SICC project 
targets, since it allows to progressively synchronize the various distributed data- 
bases. This increase in database correlation then means that data manipulation 
can be more and more de-centralized towards municipalities while keeping a 
central high-level control. 
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The first prototype of the SICC project was implemented in 1995 by AIPA and 
the Italian National Research Council. This prototype proved the feasibility of 
the technical solution and of the organizational model proposed. 

Then SOGEI, the Italian company managing the computerized information sys- 
tem of the Ministry of Finance, developed a second prototype, with a better 
degree of integration among cadastral data and services. This prototype has 
been put into operations in peripheral offices of Neaples municipality in May 
1997. 

It was then subsequently validated, through the involvement of about 100 
Municipalities ranging from Region chief towns to very small ones and a small 
sample of notaries and certified land surveyors, for about one year. 

Finally, in September 1998 the engineereed system, named SISTFR P) and de- 
veloped as well by SOGFI, has been put into nation-wide operation. 

Access to the system is through a WFB-based interface and the effectiveness 
of its use is demonstrated by the sharp increase of requests managed by it during 
the first months. In the month of January 1999 there has been already more than 
100.000 cadastral certification queries. Remember that such a query is usually 
paid by its final user. 

The final phase of the whole project is running in 1999 and aims at extending 
the range of services provided to end users. 

It is as well planned to use the Access Keys Warehouse approach in the imple- 
mentation of other large distributed applications for the Italian Public Admin- 
istration, that are currently in the definition phase at AIPA. 
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Abstract. Traffic telematics comprises services like traffic information, 
navigation, and emergency call services. Providers of such services need large 
spatial databases especially for the representation of detailed street networks. 
However, the implementation of the services requires a functionality which 
cannot be offered by a state-of-the-art spatial database system. In this paper, 
some typical requirements concerning the determination of a vehicle position, 
the traffic-depending computation of routes, and the management of moving 
spatial objects are discussed in respect to spatial database systems. It will be 
shown, that these tasks demand for the capability to handle spatio-temporal data 
and queries. 



1 Introduction 

One of the most challenging and encouraging applications of state-of-the-art 
technology is the field of traffic telematics. In the last few years, several new 
companies or new divisions have been founded in order to establish services 
concerning the collection, processing and transmission of data and information in 
respect to the road trafficQ This process was set off (1) by the urgent need of the road 
users for information and assistance and (2) by the development of new technologies: 
traffic telematics would not be possible without the availability of mobile (voice and 
data) communication techniques (e.g. the GSM standard including the short message 
service SMS) [3] and without satellite location (especially GPS). Typical services 
around traffic telematics are the following [4] : 

• traffic information services via (mobile) telephones or special car terminals, 

• on-board and off-board navigation services, 

• breakdown and emergency call services, 



' In Germany, some of the companies working in the field of traffic telematics are the 
Mannesmann Autocom GmbH (http://www.passo.de), the Tegaron Telematics 
GmbH (http : / /www . tegaron . de) , the DDG Gesellschaft fiir Verkehrsdaten GmbH, 
and the automobile club ABAC (http://www.adac.de). Also most of the car 
manufacturers have founded special divisions or have commissioned subsidiaries in order to 
establish such services. 

R.H. Giiting, D. Papadias, F. Lochovsky (Eds.): SSD’99, LNCS 1651, pp. 365-369, 1999. 

© Springer- Verlag Berlin Heidelberg 1999 





366 



Thomas Brinkhoff 



• information and booking services, and 

• fleet services. 

In order to offer such services to customers, the service center of a provider needs the 
access to a large heterogeneous database storing alphanumerical data like customer 
records as well as spatial data. These spatial data consist of different vector and raster 
data sets: examples are administrative areas, postal code areas, and detailed 
topographical raster maps. Most important is the very detailed representation of the 
street network. Such street data are typically delivered according to the European 
GDF standard (“geographic data file” [2]) containing information like road names, 
house numbers, points of interest, and traffic restrictions. 

Another typical aspect of the spatial database is the change of data: Some parts are 
relatively static (e.g. the street network) whereas other parts are permanently 
changing. For example, the information about traffic jams and the average speed on a 
road may change in intervals of only few minutes. A special motivation for using 
database systems in order to store changing or volatile data is the aspired robustness 
of the services: in a multi-computer and multi-process environment, the services can 
be synchronized best using a central database. 

In the rest of the paper, some typical requirements are presented concerning spatial 
database systems in order to implement services around traffic telematics. Several of 
these requirements cannot be satisfied by state-of-the-art database systems. 



2 Determining the Location of a Vehicle 

In order to implement a breakdown and emergency call service, it is necessary to 
determine automatically the location of the customer’s vehicle and to derive the road 
name and the house number interval (or the names of crossing roads). But also for 
other services it is essential to locate a vehicle. Examples are off-board navigation 
services (see section 3), where the actual position of the vehicle is automatically used 
as starting point of the route computation, and the "floating car data" technique 
(FCD) [5] where speed changes combined with the location of the vehicle are used 
for the detection of traffic jams. 

The actual position of the vehicle which has been transmitted to the service center 
is used for locating the vehicle. That sounds easy. However, the position delivered by 
GPS is imprecise because of jamming or/and because of the interference originated by 
buildings. In some cases, the accuracy is only several hundreds of meters. In order to 
solve this problem, not only the actual position of the vehicle, but also additional 
positions are transmitted to the service center [5]. At this positions, the vehicle 
performed special maneuvers. In addition, the accuracy of the positions, the distance 
which the vehicle drove between the positions, and the direction of the car at the 
positions (inch accuracy) are transmitted. These data are called string of pearls, an 
example is depicted in figure 1 . 

To locate a vehicle means to determine of the road segment(s) where the vehicle 
could be, including a probability of the location. For performing such a computation, 
it is necessary 

• to handle imprecise positions, 

• to handle (imprecise) conditions between imprecise positions, and 
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• to take the underlying network including (time-dependent) traffic restrictions into 
account. 




Fig. 1. String of pearls: seven pearls where the oldest pearl has number “1” and the most actual 
pearl is provided with number “7”. The sectors of the pearls represent the direction of the 
vehicle including the possible error. The order of the pearls is illustrated by the line which 
connects the pearls. The small cross shows the real position of the vehicle. In the background, 
the street network is depicted; the symbols represent traffic restrictions. 



The computation of the location can he done outside of the database system by a 
program using standard spatial queries. Because of performance and maintenance 
issues, it seems to be reasonable to locate the vehicle by a query which can be 
performed and optimized by a spatial database system. However, state-of-the-art 
database systems are not able to perform such queries. 



3 Traffic-Depending Route Computation 

On-board navigation terminals, which allow to compute the shortest route and to 
navigate the driver to the destination, are state of the art. The map is generally stored 
in the car on a compact disk. However, these terminals do not consider the actual 
traffic situation. A route computation which takes the traffic situation into 
consideration can be performed best in a service center where the (complete) traffic 
situation is known and where the current street network and the current traffic 
restrictions are maintained. Such an approach is called off-board navigation. 

The computation of a route which takes the actual traffic situation into account is 
time-dependent. In order to compute the weight of a road segment used by the routing 
algorithm, (1) the time when this road segment is expected to be passed must be 
determined and (2) the expected speed at this road segment at the expected time must 
be computed. For performing such a spatio-temporal query, the database system must 
be able: 

• to handle time, 

• to perform updates efficiently, 
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• to compute routes efficiently, and 

• to handle different times in one query depending on the route computed in the 
same query. 

Especially, the last requirement is hard to fulfill. 



4 Moving Objects 

For fleet services, it is necessary to keep track on the positions of the vehicles. Also 
other (future) traffic telematics services such as toll charging via GPS/GSM, theft 
protection, and the integration of individual and public transport need the 
management of large sets moving spatial objects. 

The support of motion requires that the spatial database system stores moving 
objects efficiently and supports spatial queries with time conditions in respect to the 
actual time, in respect to the past and, especially, in respect to the (near) future. The 
larger the number of moving objects, the more important performance issues will 
become. 

The motion of vehicles (or other means of transport) is influenced and restricted by 
different factors: 

- the street network: the street network channels the traffic. In consequence, no 
traffic can be observed outside of the network and most of the vehicles use mainly 
a small set of the network, i.e. they drive on major roads and motorways. This 
observation is also valid for public transport systems. 

- the traffic, in the rush hour or in traffic jams, the average speed decreases. For 
example, this effect is used by the FCD technique in order to detect traffic jams. 

- time schedules: the motion in public transport systems but also - on special parts of 
the network (e.g. ferry lines) - of individual vehicles is controlled by time 
schedules. 

- other conditions: weather, day time, week day, holiday periods, etc. have influence 
on the average and the individual behavior of the vehicles. 

For storing moving vehicles in a spatial database system, these factors have to be 
taken into consideration: First, the type of motion is correlated to performance issues: 
in order to support the storage of moving objects, the database system has to offer 
adequate data models and efficient access structures. For example, the structure of the 
network has to be taken into account in order to achieve high performance. Second, 
the motion influences the query processing process, especially if queries in respect to 
the (near) future are considered. This affects also the design of query languages. 
Furthermore, the integration of a knowledge base storing rules about the typical 
movement of vehicles could be useful in order to answer queries about moving 
vehicles. 

These requirements illustrate that state-of-the-art (spatial) database systems need 
considerable improvements for an adequate support of moving objects. 
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5 Conclusions 

Several requirements resulting from the field of traffic telematics were presented in 
this paper. Some of them could be solved by a state-of-the-art database (e.g. the 
storage of imprecise positions), others are topic of actual research activities (e.g. the 
CHOROCHRONOS project for spatio-temporal database systems [1]). For the 
efficient management of moving objects, for example, many questions are not solved 
[6]. Also the use of a spatio-temporal database system for the computation of traffic- 
dependent routes (see section 3) seems to be an unsolved problem. 

A special problem is the integration of different techniques and solutions into one 
database system. Due to maintenance and operating reasons, it is an urgent need to 
use one standardized (spatial) database system to store the spatial data and implement 
the applications which are used by the different services. The use of several special- 
purpose systems and algorithms for fulfilling the requirements presented in this paper 
is extremely expensive (in respect to time and money). This observation will be 
strengthened by the expected transition of the static street network (actual update rate: 
quarterly) into a permanently updated network. In this case, an on-line update will be 
required which makes the maintenance of several databases and special-purpose 
systems more difficult and more expensive. 
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